# Detección de fraude en transacciones bancarias 2

El siguiente análisis modela la relación entre las diversas variables de una muestra de transacciones bancarias y aquellas que resultaron fraudulentas con el fin de prevenir futuras situaciones similares. El dataset utilizado se encuentra disponible en https://www.kaggle.com/ntnu-testimon/paysim1.

### Carga de librerías

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
import random

### Carga de dataset

In [2]:
df = pd.read_csv(r"./transactions.csv")
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
df.shape

(6362620, 11)

**El diccionario aportado por el autor del dataset describe las siguientes variables:**

* step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

* type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

* amount - amount of the transaction in local currency.

* nameOrig - customer who started the transaction

* oldbalanceOrg - initial balance before the transaction

* newbalanceOrig - new balance after the transaction

* nameDest - customer who is the recipient of the transaction

* oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

* newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

* isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

* isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

**De lo anterior se extraen las siguientes conclusiones:**

* step: no aporta información relevante por tratarse de unidades de tiempo

* type: se evaluará en función de la frecuencia en que se desarrolla cada actividad

* amount: se mantendrá inalterable

* nameOrig: no aporta información relevante por tratarse del nombre del cliente que realiza la transacción

* oldbalanceOrg: se mantendrá inalterable

* newbalanceOrig: se mantendrá inalterable

* nameDest: no aporta información relevante por tratarse del nombre del cliente que realiza la transacción, exceptuando por aquellos cuya letra inicial es "M" (Merchants), por lo que se convertirá en categórica

* oldbalanceDest: al no poseer datos de los clientes cuya letra inicial es "M" (Merchants), resulta inconsistente para el análisis

* newbalanceDest: al no poseer datos de los clientes cuya letra inicial es "M" (Merchants), resulta inconsistente para el análisis

* isFraud: variable target

* isFlaggedFraud: son fraudes detectados automáticamente por el sistema ya que son transacciones superiores a 200.000. Se realizará un modelo que la inserte en la variable target y uno que la ignore.

### Preprocesado de datos

#### Eliminación de "step"

In [4]:
df0 = df

df0 = df0.drop("step", axis = 1)

#### Filtrado de "type"

In [5]:
df0.groupby("type").mean()

Unnamed: 0_level_0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CASH_IN,168920.242004,3590464.0,3759379.0,1587919.0,1467105.0,0.0,0.0
CASH_OUT,176273.964346,46023.8,17474.19,1497758.0,1691326.0,0.00184,0.0
DEBIT,5483.665314,68647.34,65161.65,1493136.0,1513003.0,0.0,0.0
PAYMENT,13057.60466,68216.83,61837.89,0.0,0.0,0.0,0.0
TRANSFER,910647.009645,54441.85,10288.16,2567606.0,3554567.0,0.007688,3e-05


**Sólo las transacciones de tipo "CASH_OUT" y "TRANSFER" (aquellas donde se retira y transfiere dinero a otra cuenta) presentan casos de fraude, por lo que se filtrará el resto**

In [6]:
df1 = df0

df1 = df1[(df1["type"] != "CASH_IN")]
df1 = df1[(df1["type"] != "DEBIT")]
df1 = df1[(df1["type"] != "PAYMENT")]

df1.groupby("type").mean()

Unnamed: 0_level_0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CASH_OUT,176273.964346,46023.804795,17474.192737,1497758.0,1691326.0,0.00184,0.0
TRANSFER,910647.009645,54441.851725,10288.156703,2567606.0,3554567.0,0.007688,3e-05


#### Eliminación de "nameOrig"

In [7]:
df2 = df1

df2 = df2.drop("nameOrig", axis = 1)

#### Conversión de "nameDest" a categórica

In [8]:
df2.nameDest.sort_values().unique()

array(['C1000004082', 'C1000004940', 'C1000013769', ..., 'C999993662',
       'C999996264', 'C999999956'], dtype=object)

**Ningún destinatario es de tipo "M" (Merchant), por lo que se elimina la variable. Las variables "oldbalanceDest" y "newbalanceDest" se mantienen inalterables ya que ahora sí se poseen datos consistentes**

In [9]:
df3 = df2

df3 = df3.drop("nameDest", axis = 1)

#### Evaluación de "isFlaggedFraud"

In [10]:
df3.groupby("isFraud").mean()

Unnamed: 0_level_0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
isFraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,314115.5,42879.69,15567.699347,1706998.0,2052024.0,0.0
1,1467967.0,1649668.0,192392.631836,544249.6,1279708.0,0.001948


**Ya que "isFlaggedFraud" no aporta nueva información sobre el fraude a la variable target, se elimina**

In [11]:
df4 = df3

df4 = df4.drop("isFlaggedFraud", axis = 1)

#### Conversión de "type" a categórica

In [12]:
df5 = df4

categoricas = ["type"]

for var in categoricas:
    dummy = pd.get_dummies(df5[var], prefix=var)
    df5 = pd.concat([df5, dummy], axis = 1)
    
df5 = df5.drop(categoricas, axis = 1)
df5.columns.values

array(['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'type_CASH_OUT', 'type_TRANSFER'],
      dtype=object)

In [13]:
df5.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_TRANSFER
2,181.0,181.0,0.0,0.0,0.0,1,0,1
3,181.0,181.0,0.0,21182.0,0.0,1,1,0
15,229133.94,15325.0,0.0,5083.0,51513.44,0,1,0
19,215310.3,705.0,0.0,22425.0,0.0,0,0,1
24,311685.89,10835.0,0.0,6267.0,2719172.89,0,0,1


In [14]:
df5.shape

(2770409, 8)

In [15]:
df5.isFraud.mean()*100

0.2964544224336551

In [16]:
df = df5

target = "isFraud"
predictoras = [x for x in df.columns.values if x not in target]

**El dataset presenta 2770409 muestras y 8 columnas, de las cuales 7 son predictoras y 1 es target con penetración del 0.2964544224336551%. Se procede a aumentar la penetración.**
### Ajuste de penetración

In [17]:
penetracion = 45

df_0 = df[df[target] == 0]
df_1 = df[df[target] == 1]

n = df_1.shape[0] * (100-penetracion)/penetracion
porc = n * 100 / df_0.shape[0]

random.seed(2403)

index = random.choices(range(len(df_0)), k = int(n))

df_0_n = df_0.iloc[index]

df_pen = pd.concat([df_1, df_0_n], axis = 0)

print(df_pen.shape[0])
print(df_pen[target].mean())

18251
0.45000273957591364


### Modelado

In [18]:
df = df_pen

x = df[predictoras]
y = df[target]

lm = linear_model.LogisticRegression(max_iter=2000)
lm.fit(x, y.values.ravel())
print(lm.score(x, y) * 100)
pd.DataFrame(list(zip(x.columns, np.transpose(lm.coef_))))

91.9620842693551


Unnamed: 0,0,1
0,amount,[-3.422284492972445e-06]
1,oldbalanceOrg,[2.3778393223311844e-05]
2,newbalanceOrig,[-2.2277958342877366e-05]
3,oldbalanceDest,[1.2700322888691022e-05]
4,newbalanceDest,[-1.2906338330259779e-05]
5,type_CASH_OUT,[-7.357540963818567e-10]
6,type_TRANSFER,[1.530212821056125e-10]


### Validación cruzada, matriz de confusión, sensibilidad y especificidad

In [19]:
scores = cross_val_score(linear_model.LogisticRegression(max_iter=2000), x, y.values.ravel(), scoring = "accuracy", cv = 10)
df["pred"] = lm.predict(x)
confusion_matrix = pd.crosstab(df.pred, y)
TN = confusion_matrix.iloc[0][0]
FP = confusion_matrix.iloc[1][0]
FN = confusion_matrix.iloc[0][1]
TP = confusion_matrix.iloc[1][1]
sensibilidad = TP * 100 / (TP + FN)
especificidad = TN * 100 / (TN + FP)
print("La tasa de acierto tras validación cruzada es de {v:.2f}%, la sensibilidad es de {s:.2f}% y la especificidad es de {e:.2f}%".format(v=scores.mean()*100, s=sensibilidad, e=especificidad))
confusion_matrix

La tasa de acierto tras validación cruzada es de 92.08%, la sensibilidad es de 96.31% y la especificidad es de 88.40%


isFraud,0,1
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8874,303
1,1164,7910


### Conclusiones

* Con una tasa de penetración del 45% se logra una tasa de acierto general superior al 92%, desglosada en:
        * Una sensibilidad superior al 96%
        * Una especificidad superior al 88%
        
* Los aumentos en la penetración aumentan la sensibilidad pero reducen la especificidad. Será decisión del usuario la elección de una mayor tasa de detección de fraudes a expensas de una menor tasa de transacciones rechazadas injustificadamente, o viceversa.