# SEN163A - Fundamentals of Data Analytics
## Assignment 1 - Fraud detection
### Dr. Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

# General imports

In [1]:
import sqlite3
import pandas as pd
import matplotlib as plt

In [2]:
# Read data directly from SQLITE DB

In [3]:
# Open sqlite3 connection
connection = sqlite3.connect('./transaction_data.db')

#cur = con.cursor()
df = pd.read_sql_query("SELECT * FROM transaction_data;",connection)

## Commit and close sqlite3 connection
connection.commit()
connection.close()


In [4]:
df.head()

Unnamed: 0,id,timestamp,type,amount,nameOrig,oldbalanceOrig,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest
0,1,1,TRANSFER,0.01,C1231006815,170136.0,170135.99,C52983754,0.01,0.02
1,2,1,TRANSFER,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,9839.63
2,3,1,TRANSFER,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,1864.28
3,4,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,181.0
4,5,1,TRANSFER,181.0,C840083671,181.0,0.0,C38997010,21182.0,21363.0


In [5]:
df.shape

(7734834, 10)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7734834 entries, 0 to 7734833
Data columns (total 10 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   id              int64 
 1   timestamp       int64 
 2   type            object
 3   amount          object
 4   nameOrig        object
 5   oldbalanceOrig  object
 6   newbalanceOrig  object
 7   nameDest        object
 8   oldbalanceDest  object
 9   newbalanceDest  object
dtypes: int64(2), object(8)
memory usage: 590.1+ MB


In [7]:
df = df.astype({"amount": float, 
           "oldbalanceOrig": float,
           "newbalanceOrig": float,
           "oldbalanceDest": float,
           "newbalanceDest": float,
            })

In [8]:
df['timestamp'] = df.timestamp.astype('category')
df['nameOrig'] = df.nameOrig.astype('category')
df['nameDest'] = df.nameDest.astype('category')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7734834 entries, 0 to 7734833
Data columns (total 10 columns):
 #   Column          Dtype   
---  ------          -----   
 0   id              int64   
 1   timestamp       category
 2   type            object  
 3   amount          float64 
 4   nameOrig        category
 5   oldbalanceOrig  float64 
 6   newbalanceOrig  float64 
 7   nameDest        category
 8   oldbalanceDest  float64 
 9   newbalanceDest  float64 
dtypes: category(3), float64(5), int64(1), object(1)
memory usage: 749.6+ MB


## What happens if a transaction occurs regularly?

- The money gets removed from the **Origin** $\rightarrow new_{\text{Orig}} = old_{\text{Orig}} - amount$ or $old_{\text{Orig}} - new_{\text{Orig}} = amount$
    - Is this always true?
- The money gets added to the **Destintation** $\rightarrow new_{\text{Dest}} = old_{\text{Dest}} + amount$ or $new_{\text{Dest}} - old_{\text{Dest}} = amount$
    - Is this always true?

In [10]:
n_transactions = df.shape[0]

### Let's check on the first line

In [11]:
first_line = df.head(1)
first_line

Unnamed: 0,id,timestamp,type,amount,nameOrig,oldbalanceOrig,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest
0,1,1,TRANSFER,0.01,C1231006815,170136.0,170135.99,C52983754,0.01,0.02


#### First approach
**Wrong** - Checking on equality with a float will produce wrong results even if the values are equal.

In [12]:
(first_line['oldbalanceOrig'] - first_line['newbalanceOrig']) == first_line['amount']

0    False
dtype: bool

#### Second approach
**Wrong** - Checking if the difference of the values is equal to zero will produce wrong results even if the values are equal.

In [13]:
((first_line['oldbalanceOrig'] - first_line['newbalanceOrig']) - first_line['amount']) == 0

0    False
dtype: bool

If we check the difference, it is very close to 0 ($10^{-12}$) but not exactly zero.

In [14]:
((first_line['oldbalanceOrig'] - first_line['newbalanceOrig']) - first_line['amount'])

0    9.313226e-12
dtype: float64

#### Third approach
**Correct?** - Checking if the difference of the values is smaller than a certain threshold (`epsilon`) might produce good results.

In [15]:
epsilon = 0.1

In [16]:
((first_line['oldbalanceOrig'] - first_line['newbalanceOrig']) - first_line['amount']) < epsilon

0    True
dtype: bool

Let's have a look at the distribution of the differences between `df['oldbalanceOrig'] - df['newbalanceOrig']` and `df['amount']` in the whole dataset

In [17]:
difference_series = (df['oldbalanceOrig'] - df['newbalanceOrig']) - df['amount']

We define a set of bins, and using groupby, we check how many values belong to each bin.

In [18]:
decimal_bins = [-0.0005,0.0,0.0005,0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]

In [19]:
difference_series.groupby(pd.cut(difference_series,decimal_bins)).count()

(-0.0005, 0.0]     4598752
(0.0, 0.0005]      1136096
(0.0005, 0.001]          0
(0.001, 0.005]      695387
(0.005, 0.01]      1304599
(0.01, 0.05]             0
(0.05, 0.1]              0
(0.1, 0.5]               0
(0.5, 1.0]               0
(1.0, 5.0]               0
(5.0, 10.0]              0
(10.0, 50.0]             0
(50.0, 100.0]            0
dtype: int64

The majority of this differences are in the bin `(-0.0005, 0.0]`, corresponding to the values equal to zero. However, all the values different than zero are smaller than 0.01 (1 cent), hence the value of `epsilon` used for the comparisons necessarily needs to be smaller than 0.01.