#### **Data exploratory**

Here, we import the dataset, and perform some 
data manipulation and visualization in order 
to get familiar with the data at hand and proceed
with a the data engineering step.

In [50]:
import pandas as pd
import numpy as np
import plotly.express as px

data = pd.read_csv('B:\_GITHUB\Data-Science-Projects\Online Payments Fraud Detection\Dataset\data.csv')

In [36]:
print(data.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  


In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [38]:
print(data.isnull().sum())

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


In [39]:
print(data.type.value_counts())

CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64


#### **Data distribution Visualization**

In [40]:
data_type = data['type'].value_counts()
labels = data_type.index

fig = px.pie(data, values = data_type, names = labels, hole = 0.5)
fig.update_layout(title = "Transaction types distribution", title_x = 0.5, height = 400, width = 800)
fig.show()

#### **Correlation Data**
We check the correlation between data, it is a measure of a mutual relationship between two variables whether they are causal or not. This degree of measurement could be measured on any kind of data type (Continous and Continous, Categorical and Categorical, Continous and Categorical).

In [41]:
data_corr = data.corr()
for col in data_corr.columns:
    d = data_corr[col]
    data_ind = d.index
    data_val = d.values
    fig = px.bar(d, x = data_ind, y = data_val, color = col)
    fig.update_layout(title = f"Correlation between \"{col}\" and others", title_x = 0.5 ,width = 700, height = 400, xaxis_title = 'Columns',
    yaxis_title = 'Value')
    fig.show()

#### **Changing value type (from categorical to numerical)**

In [52]:
data['isFraud'] = np.where(data['isFraud'] == 1, 'Fraud', 'No Fraud')
data.loc[data['type'] == 'PAYMENT', 'type'] = 1
data.loc[data['type'] == 'CASH_OUT', 'type'] = 2
data.loc[data['type'] == 'CASH_IN', 'type'] = 3
data.loc[data['type'] == 'TRANSFER', 'type'] = 4
data.loc[data['type'] == 'DEBIT', 'type'] = 5


In [54]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,No Fraud,0
1,1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,No Fraud,0
2,1,4,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,No Fraud,0
3,1,2,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,No Fraud,0
4,1,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,No Fraud,0


#### **Build Fraud Detection Model**
The aim of this section is to bluid a classification model capable of detecting some fraud based on the given features.
* Perform Feature Selection
* Split the data into train and test sets
* Train Model
* Model Prediction

In [58]:
from sklearn.model_selection import train_test_split
# The selected feature are: 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig'
x = np.array(data[['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig']])
y = np.array(data[['isFraud']])
# print(x,"------",y)

[[1 9839.64 170136.0 160296.36]
 [1 1864.28 21249.0 19384.72]
 [4 181.0 181.0 0.0]
 ...
 [2 6311409.28 6311409.28 0.0]
 [4 850002.52 850002.52 0.0]
 [2 850002.52 850002.52 0.0]] ------ [['No Fraud']
 ['No Fraud']
 ['No Fraud']
 ...
 ['No Fraud']
 ['No Fraud']
 ['No Fraud']]
