# Detección de fraude en transacciones bancarias 1

El siguiente análisis modela la relación entre las diversas variables de una muestra de transacciones bancarias y aquellas que resultaron fraudulentas con el fin de prevenir futuras situaciones similares. El dataset utilizado se encuentra disponible en https://www.kaggle.com/shubhamjoshi2130of/abstract-data-set-for-credit-card-fraud-detection.

### Carga de librerías

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
import random

### Carga de dataset

In [2]:
df = pd.read_csv(r"./creditcardcsvpresent.csv")

In [3]:
df.head()

Unnamed: 0,Merchant_id,Transaction date,Average Amount/transaction/day,Transaction_amount,Is declined,Total Number of declines/day,isForeignTransaction,isHighRiskCountry,Daily_chargeback_avg_amt,6_month_avg_chbk_amt,6-month_chbk_freq,isFradulent
0,3160040998,,100.0,3000.0,N,5,Y,Y,0,0.0,0,Y
1,3160040998,,100.0,4300.0,N,5,Y,Y,0,0.0,0,Y
2,3160041896,,185.5,4823.0,Y,5,N,N,0,0.0,0,Y
3,3160141996,,185.5,5008.5,Y,8,N,N,0,0.0,0,Y
4,3160241992,,500.0,26000.0,N,0,Y,Y,800,677.2,6,Y


In [4]:
df.shape

(3075, 12)

In [5]:
df.columns = ["user", "Tdate", "avgA/T/D", "Tamount", "declined", "declines/D", "foreignT", "highriskcountry", "avgA/C/D", "avgA/C/6M", "freq6MC", "fraud"]

### Análisis exploratorio

In [6]:
df.groupby("fraud").count()

Unnamed: 0_level_0,user,Tdate,avgA/T/D,Tamount,declined,declines/D,foreignT,highriskcountry,avgA/C/D,avgA/C/6M,freq6MC
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
N,2627,0,2627,2627,2627,2627,2627,2627,2627,2627,2627
Y,448,0,448,448,448,448,448,448,448,448,448


In [7]:
df.groupby("declined").count()

Unnamed: 0_level_0,user,Tdate,avgA/T/D,Tamount,declines/D,foreignT,highriskcountry,avgA/C/D,avgA/C/6M,freq6MC,fraud
declined,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
N,3018,0,3018,3018,3018,3018,3018,3018,3018,3018,3018
Y,57,0,57,57,57,57,57,57,57,57,57


In [8]:
df.groupby("fraud").mean()

Unnamed: 0_level_0,user,Tdate,avgA/T/D,Tamount,declines/D,avgA/C/D,avgA/C/6M,freq6MC
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
N,5009082000.0,,512.19369,7662.99589,0.475828,22.807766,15.824134,0.108108
Y,5129560000.0,,531.638032,22855.440551,3.78125,248.832589,181.917187,2.055804


In [9]:
df.declined.unique(), df.foreignT.unique(), df.highriskcountry.unique(), df.fraud.unique()

(array(['N', 'Y'], dtype=object),
 array(['Y', 'N'], dtype=object),
 array(['Y', 'N'], dtype=object),
 array(['Y', 'N'], dtype=object))

In [10]:
penetracion = (df.fraud=="Y").mean()*100
acierto_bruto = 100 - penetracion
print("La penetración inicial es del {p:.2f}% y la tasa de acierto bruto es de {a:.2f}%".format(p=penetracion, a=acierto_bruto))

La penetración inicial es del 14.57% y la tasa de acierto bruto es de 85.43%


**Conclusiones del análisis exploratorio:**
* user : se marcarán los usuarios con múltiples transacciones que hayan realizado al menos una fraudulenta/rechazada
* tdate : se eliminará por estar vacía y no tener fuerte correlación con el target
* avgA/T/D: se percibe una diferencia poco significativa respecto al target
* Tamount : se percibe una diferencia significativa respecto al target
* declined : se marcarán los usuarios con múltiples transacciones que hayan realizado al menos una rechazada
* declines/D : se percibe una diferencia significativa respecto al target
* foreign/T : varible categórica
* highriskcountry : variable categórica
* avgA/C/D : se percibe una diferencia significativa respecto al target
* avgA/C/6M : se percibe una diferencia significativa respecto al target
* freq6MC : se percibe una diferencia significativa respecto al target
* fraud(target) : se marcarán los usuarios con múltiples transacciones que hayan realizado al menos una fraudulenta

### Preprocesado de datos

**Eliminación de variables**

In [11]:
df1 = df.drop(["Tdate"], axis=1)
df1.head()

Unnamed: 0,user,avgA/T/D,Tamount,declined,declines/D,foreignT,highriskcountry,avgA/C/D,avgA/C/6M,freq6MC,fraud
0,3160040998,100.0,3000.0,N,5,Y,Y,0,0.0,0,Y
1,3160040998,100.0,4300.0,N,5,Y,Y,0,0.0,0,Y
2,3160041896,185.5,4823.0,Y,5,N,N,0,0.0,0,Y
3,3160141996,185.5,5008.5,Y,8,N,N,0,0.0,0,Y
4,3160241992,500.0,26000.0,N,0,Y,Y,800,677.2,6,Y


**Marcador de fraude previo**

In [12]:
df_fraud = df1[df1["fraud"] == "Y"]
user_fraud = df_fraud.user.unique()

In [13]:
df2 = df1
userfraudmark = []

for i in range(len(df2)):
    if df2.loc[i, "user"] in user_fraud:
        userfraudmark.append(1)
    else:
        userfraudmark.append(0)
        
df2["userfraudmark"] = userfraudmark

In [14]:
df2.head()

Unnamed: 0,user,avgA/T/D,Tamount,declined,declines/D,foreignT,highriskcountry,avgA/C/D,avgA/C/6M,freq6MC,fraud,userfraudmark
0,3160040998,100.0,3000.0,N,5,Y,Y,0,0.0,0,Y,1
1,3160040998,100.0,4300.0,N,5,Y,Y,0,0.0,0,Y,1
2,3160041896,185.5,4823.0,Y,5,N,N,0,0.0,0,Y,1
3,3160141996,185.5,5008.5,Y,8,N,N,0,0.0,0,Y,1
4,3160241992,500.0,26000.0,N,0,Y,Y,800,677.2,6,Y,1


In [15]:
df_no_fraud = df2[df2["userfraudmark"] == 0]
df_no_fraud.groupby("fraud").mean()

Unnamed: 0_level_0,user,avgA/T/D,Tamount,declines/D,avgA/C/D,avgA/C/6M,freq6MC,userfraudmark
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
N,5005300000.0,511.038849,7640.553301,0.476917,21.133918,14.521175,0.092331,0.0


In [16]:
df_no_mark = df2[df2["fraud"] == "N"]
df_no_mark.groupby("userfraudmark").mean()

Unnamed: 0_level_0,user,avgA/T/D,Tamount,declines/D,avgA/C/D,avgA/C/6M,freq6MC
userfraudmark,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,5005300000.0,511.038849,7640.553301,0.476917,21.133918,14.521175,0.092331
1,6661274000.0,1016.666667,17466.666667,0.0,754.0,585.0,7.0


* Existen transacciones no fraudulentas realizadas por usuarios con marcador de fraude previo, por lo que este se incorpora

**Marcador de rechazo previo**

In [17]:
df_rechazo = df2[df2["declined"] == "Y"]
user_rechazo = df_rechazo.user.unique()

In [18]:
df3 = df2
userdeclinedmark = []

for i in range(len(df3)):
    if df3.loc[i, "user"] in user_rechazo:
        userdeclinedmark.append(1)
    else:
        userdeclinedmark.append(0)
        
df3["userdeclinedmark"] = userdeclinedmark

In [19]:
df3.head()

Unnamed: 0,user,avgA/T/D,Tamount,declined,declines/D,foreignT,highriskcountry,avgA/C/D,avgA/C/6M,freq6MC,fraud,userfraudmark,userdeclinedmark
0,3160040998,100.0,3000.0,N,5,Y,Y,0,0.0,0,Y,1,0
1,3160040998,100.0,4300.0,N,5,Y,Y,0,0.0,0,Y,1,0
2,3160041896,185.5,4823.0,Y,5,N,N,0,0.0,0,Y,1,1
3,3160141996,185.5,5008.5,Y,8,N,N,0,0.0,0,Y,1,1
4,3160241992,500.0,26000.0,N,0,Y,Y,800,677.2,6,Y,1,0


In [20]:
df3.groupby("fraud").mean()

Unnamed: 0_level_0,user,avgA/T/D,Tamount,declines/D,avgA/C/D,avgA/C/6M,freq6MC,userfraudmark,userdeclinedmark
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
N,5009082000.0,512.19369,7662.99589,0.475828,22.807766,15.824134,0.108108,0.002284,0.003426
Y,5129560000.0,531.638032,22855.440551,3.78125,248.832589,181.917187,2.055804,1.0,0.127232


In [21]:
df3.userfraudmark.mean(), df3.userdeclinedmark.mean()

(0.14764227642276423, 0.021463414634146343)

* El marcador de rechazo previo difiere del de fraude previo, por lo que se incorpora

**Tratamiento de variables categóricas**

In [22]:
df4 = df3
categoricas = ["declined", "foreignT", "highriskcountry"]

for var in categoricas:
    dummy = pd.get_dummies(df4[var], prefix=var)
    df4 = pd.concat([df4, dummy], axis = 1)
    
df4 = df4.drop(categoricas, axis = 1)
df4 = df4.drop(["user"], axis = 1)
df4.columns.values

array(['avgA/T/D', 'Tamount', 'declines/D', 'avgA/C/D', 'avgA/C/6M',
       'freq6MC', 'fraud', 'userfraudmark', 'userdeclinedmark',
       'declined_N', 'declined_Y', 'foreignT_N', 'foreignT_Y',
       'highriskcountry_N', 'highriskcountry_Y'], dtype=object)

**Corrección de variable target**

In [23]:
df5 = df4

df5["fraud"] = np.where(df5["fraud"] == "Y", 1, 0)

In [24]:
df5.head()

Unnamed: 0,avgA/T/D,Tamount,declines/D,avgA/C/D,avgA/C/6M,freq6MC,fraud,userfraudmark,userdeclinedmark,declined_N,declined_Y,foreignT_N,foreignT_Y,highriskcountry_N,highriskcountry_Y
0,100.0,3000.0,5,0,0.0,0,1,1,0,1,0,0,1,0,1
1,100.0,4300.0,5,0,0.0,0,1,1,0,1,0,0,1,0,1
2,185.5,4823.0,5,0,0.0,0,1,1,1,0,1,1,0,1,0
3,185.5,5008.5,8,0,0.0,0,1,1,1,0,1,1,0,1,0
4,500.0,26000.0,0,800,677.2,6,1,1,0,1,0,0,1,0,1


In [25]:
df5.shape

(3075, 15)

In [26]:
df5.fraud.mean()

0.1456910569105691

In [27]:
df = df5

target = "fraud"
predictoras = [x for x in df.columns.values if x not in target]

**El dataset presenta 3075 muestras y 15 columnas, de las cuales 14 son predictoras y 1 es target con penetración del 14.57%. Se procede a aumentar la penetración.**
### Ajuste de penetración

In [28]:
penetracion = 40

df_0 = df[df[target] == 0]
df_1 = df[df[target] == 1]

n = df_1.shape[0] * (100-penetracion)/penetracion
porc = n * 100 / df_0.shape[0]

random.seed(2403)

index = random.choices(range(len(df_0)), k = int(n))

df_0_n = df_0.iloc[index]

df_pen = pd.concat([df_1, df_0_n], axis = 0)

print(df_pen.shape[0])
print(df_pen[target].mean())

1120
0.4


### Modelado

In [29]:
df = df_pen

x = df[predictoras]
y = df[target]

lm = linear_model.LogisticRegression(max_iter=2000)
lm.fit(x, y.values.ravel())
print(lm.score(x, y) * 100)
pd.DataFrame(list(zip(x.columns, np.transpose(lm.coef_))))

100.0


Unnamed: 0,0,1
0,avgA/T/D,[-0.020016283057260272]
1,Tamount,[0.0007258780635909528]
2,declines/D,[0.4604019011150341]
3,avgA/C/D,[0.0043670865749278484]
4,avgA/C/6M,[-0.004197303060503699]
5,freq6MC,[0.5144455600487143]
6,userfraudmark,[4.054866947296831]
7,userdeclinedmark,[0.19643495757351329]
8,declined_N,[-0.8699080052126658]
9,declined_Y,[0.19390972653194868]


### Validación cruzada, matriz de confusión, sensibilidad y especificidad

In [30]:
scores = cross_val_score(linear_model.LogisticRegression(max_iter=2000), x, y.values.ravel(), scoring = "accuracy", cv = 10)
df["pred"] = lm.predict(x)
confusion_matrix = pd.crosstab(df.pred, y)
TN = confusion_matrix.iloc[0][0]
FP = confusion_matrix.iloc[1][0]
FN = confusion_matrix.iloc[0][1]
TP = confusion_matrix.iloc[1][1]
sensibilidad = TP * 100 / (TP + FN)
especificidad = TN * 100 / (TN + FP)
print("La tasa de acierto tras validación cruzada es de {v:.2f}%, la sensibilidad es de {s:.2f}% y la especificidad es de {e:.2f}%".format(v=scores.mean()*100, s=sensibilidad, e=especificidad))
confusion_matrix

La tasa de acierto tras validación cruzada es de 99.73%, la sensibilidad es de 100.00% y la especificidad es de 100.00%


fraud,0,1
pred,Unnamed: 1_level_1,Unnamed: 2_level_1
0,672,0
1,0,448
