<font color="#D31525"><h3 align="left">Detección de fraude en transacciones financieras</h3></font>
<font color="#2C3E50"><h3 align="left">GENERACIÓN DE DATASET APLICANDO LA TECNICA DE BALANCEO DE CLASES OVERSAMPLING</h3></font>
<font color="#2C3E50"><h3 align="left">BALANCEO DE CLASES USANDO LA TÉCNICA DE SMOTE, PARA GENERAR REGISTROS DE LA CLASE MINORITARIA</h3></font>
<font color="#2C3E50"><h3 align="left">GENERACION FICHERO TRAIN-TEST Y EVALUACION USANDO PCA</h3></font>

## Importar librerias
En esta primera parte del código, se realizan las llamadas a las librerías que se utilizarán en el Notebook:

In [1]:
# Paquetes de manipulación de datos
import pandas as pd
import numpy as np
import boto3

# Paquete para realizar el Oversampling (SMOTE)
from collections import Counter
from imblearn.over_sampling import SMOTE

# Paquetes de visualización
import matplotlib.pyplot as plt
import seaborn as sns

# Paquete de manipulación de fechas
import datetime as dt

# Paquete para hacer el PCA.
from sklearn.decomposition import PCA


## Importar DataSet
Una vez que hemos limpiado el dataset, realizamos el estudio descriptivo y discovery de los datos. Para ello, importamos el dataset limpio:

In [2]:
s3 = boto3.client("s3")

# Seleccionamos el bucket con el que vamos a trabajar
BUCKET_NAME = 'tfmfraud'

In [3]:
# Descargamos el fichero del bucket de s3 a la máquina EC2 para poder trabajar con él.
s3.download_file(Bucket = BUCKET_NAME, Key = 'df_new_var.csv',Filename = '/tmp/df_new_var.csv')

In [4]:
#Leemos el fichero y lo metemos en un dataframe.
df = pd.read_csv('/tmp/df_new_var.csv', dtype={'rank':'category'})

In [5]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,amount_category
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0.0,0.0,1,1,0,1,9839.64,0.0,1,"(-0.001, 200000.0]"
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0.0,0.0,1,1,0,1,1864.28,0.0,1,"(-0.001, 200000.0]"
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1.0,0.0,1,1,0,0,181.0,0.0,1,"(-0.001, 200000.0]"
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1.0,0.0,1,1,0,0,181.0,-21182.0,1,"(-0.001, 200000.0]"
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0.0,0.0,1,1,0,1,11668.14,0.0,1,"(-0.001, 200000.0]"


In [6]:
df.shape

(6362620, 19)

In [7]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/df_new_var.csv

Limpiamos el dataset para ir balanceando la muestra: Se hace un tratamiento igual que en el caso de UNDERSAMPLING

1.-  Eliminamos registros que cumplen:
  *  Tanto el origen como el destinatario son clientes, no comercios
  *  El valor de la variable amount es mayor que 0
  *  Tanto el balance de origen como de destino es 0
  *  No salen marcados en nuestro dataset como movimientos fraudulentos.
  
2.- Eliminamos registros que cumplen:
  *  El destinatario son comercios
  *  El valor de la variable amount es mayor que 0 
  *  El balance de origen es 0 (En este dataset los balances de los comercios destinatiarios siempre es 0)
  *  No salen marcados en nuestro dataset como movimientos fraudulentos.
  
3.- Eliminamos registros que cumplen son de un tipo diferente a: CASH_OUT y TRANSFER


In [8]:
df1 = df.drop(df[(df['oldbalanceOrg'] == 0) &
               (df['newbalanceOrig'] == 0) &
               (df['oldbalanceDest'] == 0) &
               (df['newbalanceDest'] == 0) &
               (df["ind_merchant"]== 0) &
               (df['amount'] > 0) & 
               (df['isFraud'] == 0)].index)

In [9]:
df1.shape

(6362596, 19)

In [10]:
df2 = df.drop(df1[(df1['oldbalanceOrg'] == 0) &
               (df1['newbalanceOrig'] == 0) &
               (df1["ind_merchant"]== 1) &
               (df1['amount'] > 0) & 
               (df1['isFraud'] == 0)].index)

In [11]:
df2.shape

(5588375, 19)

In [12]:
df3 = df2[df2.type.isin(["CASH_OUT", "TRANSFER"])]

In [13]:
df3.shape

(2770409, 19)

In [14]:
df3.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,amount_category
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1.0,0.0,1,1,0,0,181.0,0.0,1,"(-0.001, 200000.0]"
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1.0,0.0,1,1,0,0,181.0,-21182.0,1,"(-0.001, 200000.0]"
15,1,CASH_OUT,229133.94,C905080434,15325.0,0.0,C476402209,5083.0,51513.44,0.0,0.0,1,1,1,0,15325.0,46430.44,1,"(200000.0, 400000.0]"
19,1,TRANSFER,215310.3,C1670993182,705.0,0.0,C1100439041,22425.0,0.0,0.0,0.0,1,1,1,0,705.0,-22425.0,1,"(200000.0, 400000.0]"
24,1,TRANSFER,311685.89,C1984094095,10835.0,0.0,C932583850,6267.0,2719172.89,0.0,0.0,1,1,1,0,10835.0,2712905.89,1,"(200000.0, 400000.0]"


Eliminamos los dataframes df y df2 para liberar espacio.

In [15]:
del df
del df2

Vamos a eliminar registros, antes de hacer el Oversampling, para no trabajar con dataset muy grandes. La idea es quedarnos con todos las transacciones fraudulentas y de las no fraudulentas 500.000 registros.

In [17]:
fraude = df3.loc[df3['isFraud'] == 1]
noFraude = df3.loc[df3['isFraud'] == 0]
muestraNoFr = noFraude.sample(500000)

In [18]:
print(fraude.shape)
print(muestraNoFr.shape)

(8213, 19)
(2762196, 19)
(500000, 19)


In [20]:
tgSmote = pd.concat([fraude,muestraNoFr])

In [21]:
print(tgSmote.shape)

(508213, 19)


In [22]:
pd.value_counts(tgSmote['isFraud'])

0.0    500000
1.0      8213
Name: isFraud, dtype: int64

In [61]:
del df3

Aplicamos la técnica de Oversampling: SMOTE

In [48]:
X = tgSmote.drop(['isFraud','nameOrig', 'nameDest', 'amount_category'],axis = True)
y = tgSmote['isFraud']

In [49]:
print(X.shape)
print(y.shape)

(508213, 15)
(508213,)


In [50]:
y.head()

2      1.0
3      1.0
251    1.0
252    1.0
680    1.0
Name: isFraud, dtype: float64

In [51]:
X.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,0.0,1,1,0,0,181.0,0.0,1
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,0.0,1,1,0,0,181.0,-21182.0,1
251,1,TRANSFER,2806.0,2806.0,0.0,0.0,0.0,0.0,1,1,0,0,2806.0,0.0,1
252,1,CASH_OUT,2806.0,2806.0,0.0,26202.0,0.0,0.0,1,1,0,0,2806.0,-26202.0,1
680,1,TRANSFER,20128.0,20128.0,0.0,0.0,0.0,0.0,1,1,0,0,20128.0,0.0,1


In [52]:
counter = Counter(y)
print(counter)

Counter({0.0: 500000, 1.0: 8213})


Eliminamos variables categoricas generando variables dummies.

In [53]:
XDummies = pd.get_dummies(X, drop_first=True)

In [54]:
print(XDummies.shape)

(508213, 15)


In [55]:
XDummies.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,type_TRANSFER
2,1,181.0,181.0,0.0,0.0,0.0,0.0,1,1,0,0,181.0,0.0,1,1
3,1,181.0,181.0,0.0,21182.0,0.0,0.0,1,1,0,0,181.0,-21182.0,1,0
251,1,2806.0,2806.0,0.0,0.0,0.0,0.0,1,1,0,0,2806.0,0.0,1,1
252,1,2806.0,2806.0,0.0,26202.0,0.0,0.0,1,1,0,0,2806.0,-26202.0,1,0
680,1,20128.0,20128.0,0.0,0.0,0.0,0.0,1,1,0,0,20128.0,0.0,1,1


In [56]:
over = SMOTE(sampling_strategy=0.1)

In [57]:
X_sm, y_sm = over.fit_resample(XDummies, y)

In [58]:
print(XDummies.shape)
print(X_sm.shape)
print(y.shape)
print(y_sm.shape)

(508213, 15)
(550000, 15)
(508213,)
(550000,)


In [59]:
y_sm.head()

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: isFraud, dtype: float64

In [60]:
counter_sm = Counter(y_sm)
print(counter_sm)

Counter({0.0: 500000, 1.0: 50000})


Unimos en un dataframe X_sm e y_sm

In [62]:
 tgSmote2 = pd.merge(X_sm,y_sm, right_index=True, left_index=True)

In [64]:
tgSmote2.shape

(550000, 16)

In [68]:
pd.value_counts(tgSmote2['isFraud'])

0.0    500000
1.0     50000
Name: isFraud, dtype: int64

In [67]:
tgSmote2.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,type_TRANSFER,isFraud
0,1,181.0,181.0,0.0,0.0,0.0,0.0,1,1,0,0,181.0,0.0,1,1,1.0
1,1,181.0,181.0,0.0,21182.0,0.0,0.0,1,1,0,0,181.0,-21182.0,1,0,1.0
2,1,2806.0,2806.0,0.0,0.0,0.0,0.0,1,1,0,0,2806.0,0.0,1,1,1.0
3,1,2806.0,2806.0,0.0,26202.0,0.0,0.0,1,1,0,0,2806.0,-26202.0,1,0,1.0
4,1,20128.0,20128.0,0.0,0.0,0.0,0.0,1,1,0,0,20128.0,0.0,1,1,1.0


In [71]:
fraude = tgSmote2.loc[tgSmote2['isFraud'] == 1]
noFraude = tgSmote2.loc[tgSmote2['isFraud'] == 0]

## **FRAUDE**

In [72]:
evalFr = fraude.sample(800)
trainFr = fraude[~fraude.index.isin(evalFr.index)]

In [73]:
print(evalFr.shape)
print(trainFr.shape)

(800, 16)
(49200, 16)


## **NO FRAUDE**

In [74]:
evalNoFr = noFraude.sample(79200)
trainNoFr = noFraude[~noFraude.index.isin(evalNoFr.index)]

In [75]:
print(evalNoFr.shape)
print(trainNoFr.shape)

(79200, 16)
(420800, 16)


Union de los dataframes de Evaluacion de Evaluacion y Train-Test

In [76]:
evaluacion = pd.concat([evalFr,evalNoFr])
trainTest =  pd.concat([trainFr,trainNoFr])

In [77]:
print(evaluacion.shape)
print(trainTest.shape)

(80000, 16)
(470000, 16)


In [79]:
print(pd.value_counts(trainTest['isFraud']))
print(pd.value_counts(evaluacion['isFraud']))

0.0    420800
1.0     49200
Name: isFraud, dtype: int64
0.0    79200
1.0      800
Name: isFraud, dtype: int64


Guardamos los ficheros tras generar registros de la clase minoritaria y poder usarlos cuando sean necesarios.

In [80]:
trainTest.to_csv('/tmp/train_test_over.csv', index = False)
evaluacion.to_csv('/tmp/evaluacion_over.csv', index = False)

In [81]:
s3.upload_file(Bucket = BUCKET_NAME, Key = 'train_test_over.csv', Filename = '/tmp/train_test_over.csv')
s3.upload_file(Bucket = BUCKET_NAME, Key = 'evaluacion_over.csv', Filename = '/tmp/evaluacion_over.csv')

In [37]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/train_test_over.csv
!rm /tmp/evaluacion_over.csv

Leemos el fichero de train_test_under para aplicar PCA. De esta tenemos un checkpoint para no tener que rehacer todo el trabajo previo

In [82]:
# Descargamos el fichero del bucket de s3 a la máquina EC2 para poder trabajar con él.
s3.download_file(Bucket = BUCKET_NAME, Key = 'train_test_over.csv',Filename = '/tmp/train_test_over.csv')

In [83]:
#Leemos el fichero y lo metemos en un dataframe.
trainTest = pd.read_csv('/tmp/train_test_over.csv', dtype={'rank':'category'})

In [84]:
trainTest.shape

(470000, 16)

In [85]:
trainTest.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,type_TRANSFER,isFraud
0,1,181.0,181.0,0.0,0.0,0.0,0.0,1,1,0,0,181.0,0.0,1,1,1.0
1,1,181.0,181.0,0.0,21182.0,0.0,0.0,1,1,0,0,181.0,-21182.0,1,0,1.0
2,1,2806.0,2806.0,0.0,0.0,0.0,0.0,1,1,0,0,2806.0,0.0,1,1,1.0
3,1,2806.0,2806.0,0.0,26202.0,0.0,0.0,1,1,0,0,2806.0,-26202.0,1,0,1.0
4,1,20128.0,20128.0,0.0,0.0,0.0,0.0,1,1,0,0,20128.0,0.0,1,1,1.0


In [86]:
print(pd.value_counts(trainTest['isFraud']))

0.0    420800
1.0     49200
Name: isFraud, dtype: int64


In [87]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/train_test_over.csv

Preparamos el dataframe *trainTest* para aplicar un PCA
Eliminamos las columnas: isFraud, nameOrig, nameDest

In [128]:
mydata2 = trainTest.drop(['isFraud'],axis = True)
fraud = pd.DataFrame(trainTest['isFraud'])

In [129]:
mydata2.shape

(470000, 15)

In [130]:
fraud.head()

Unnamed: 0,isFraud
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [95]:
fraud.shape

(470000, 1)

* **PCA**

In [131]:
pca = PCA()
X_pca = pca.fit_transform(mydata2.values)

In [132]:
pca.explained_variance_ratio_.cumsum()

array([0.86319793, 0.95038469, 0.98866954, 0.99663263, 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ])

Viendo que con 4 variables explicamos el 99,7% de la varianza, entonces aplicaremos un PCA con 4 componentes principales. El motivo de aplicar el procedimiento de Componentes Principales, es para poder enfrentar modelos explicativos Vs modelos no explictivos.

Eliminamos los dataframes mydata2 y la lista X_pca para liberar memoria.

In [133]:
del mydata2
del X_pca

In [134]:
pca = PCA(4)
dfX_pca4 = pd.DataFrame(pca.fit_transform(mydata2Dummies.values))

In [135]:
pca.explained_variance_ratio_.cumsum()

array([0.86319793, 0.95038469, 0.98866954, 0.99663263])

In [136]:
dfX_pca4.rename(columns={0: 'PC0'
                    , 1: 'PC1'
                    , 2: 'PC2'
                    , 3: 'PC3'}, inplace=True)
dfX_pca4.head()

Unnamed: 0,PC0,PC1,PC2,PC3
0,-2587585.0,-438516.040597,16731.941171,42054.210743
1,-2575372.0,-454458.727901,-2238.48605,34600.500662
2,-2587410.0,-434868.610206,14605.315321,40424.161792
3,-2572303.0,-454589.613548,-8860.983138,31203.969577
4,-2586258.0,-410799.73813,572.015144,29667.701638


In [137]:
print(dfX_pca4.shape)
print(fraud.shape)

(470000, 4)
(470000, 1)


In [138]:
trainTestPca = pd.merge(dfX_pca4,fraud, right_index=True, left_index=True)

In [139]:
trainTestPca.shape

(470000, 5)

In [140]:
trainTestPca.head(10)

Unnamed: 0,PC0,PC1,PC2,PC3,isFraud
0,-2587585.0,-438516.0,16731.94,42054.21,1.0
1,-2575372.0,-454458.7,-2238.486,34600.5,1.0
2,-2587410.0,-434868.6,14605.32,40424.16,1.0
3,-2572303.0,-454589.6,-8860.983,31203.97,1.0
4,-2586258.0,-410799.7,572.0151,29667.7,1.0
5,-2572594.0,-407978.9,4565.545,31584.08,1.0
6,5126483.0,5552888.0,7393991.0,3074792.0,1.0
7,-2502632.0,1335916.0,-1017847.0,-750945.5,1.0
8,-479612.2,2853418.0,916082.8,78826.31,1.0
9,-2585264.0,-390046.7,-11527.98,20393.11,1.0


Chequeamos que el numero de fraude y no fraude son correctos.

In [141]:
print(pd.value_counts(trainTestPca['isFraud']))

0.0    420800
1.0     49200
Name: isFraud, dtype: int64


**Guardamos el nuevo dataset**  
Guardamos el fichero usando en nuestro bucket de s3 usando la libreria **boto3**

In [142]:
trainTestPca.to_csv('/tmp/train_test_over_pca.csv', index = False)

In [143]:
s3.upload_file(Bucket = BUCKET_NAME, Key = 'train_test_over_pca.csv', Filename = '/tmp/train_test_over_pca.csv')

In [144]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/train_test_over_pca.csv

Ahora se aplica el PCA al dataframe de Evaluacion *evaluacion*.
Leemos el fichero que hemos guardado en el paso anterior. De esta tenemos un checkpoint para no tener que rehacer todo el trabajo previo.

In [110]:
s3 = boto3.client("s3")

# Seleccionamos el bucket con el que vamos a trabajar
BUCKET_NAME = 'tfmfraud'

In [201]:
# Descargamos el fichero del bucket de s3 a la máquina EC2 para poder trabajar con él.
s3.download_file(Bucket = BUCKET_NAME, Key = 'evaluacion_over.csv',Filename = '/tmp/evaluacion_over.csv')

In [111]:
#Leemos el fichero y lo metemos en un dataframe.
evaluacion = pd.read_csv('/tmp/evaluacion_over.csv', dtype={'rank':'category'})

In [112]:
evaluacion.shape

(80000, 16)

In [113]:
evaluacion.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,day,week,isFlaggedFraud_New,ind_merchant,balanceOrig,balanceDest,hours_day,type_TRANSFER,isFraud
0,225,8318578.0,8318578.0,0.0,255858.4,8574436.0,0.0,9,1,1,0,8318578.0,8318578.0,20,0,1.0
1,700,320720.0,320720.0,0.0,1047807.0,1368527.0,0.0,29,3,1,0,320720.0,320720.0,5,0,1.0
2,146,2480947.0,2480947.0,0.0,0.0,0.0,0.0,6,1,1,0,2480947.0,0.0,13,1,1.0
3,191,36074.61,36074.61,0.0,0.0,0.0,0.0,8,1,0,0,36074.61,0.0,19,1,1.0
4,424,27279.97,27279.97,0.0,0.0,0.0,0.0,18,3,0,0,27279.97,0.0,7,1,1.0


In [115]:
print(pd.value_counts(evaluacion['isFraud']))

0.0    79200
1.0      800
Name: isFraud, dtype: int64


In [117]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/evaluacion_over.csv

In [145]:
mydata2Eval = evaluacion.drop(['isFraud'],axis = True)
fraudEval = pd.DataFrame(evaluacion['isFraud'])

In [146]:
print(mydata2Eval.shape)
print(fraudEval.shape)

(80000, 15)
(80000, 1)


In [147]:
dfeval_pca = pd.DataFrame(pca.transform(mydata2Eval.values))

In [148]:
dfeval_pca.rename(columns={0: 'PC0'
                    , 1: 'PC1'
                    , 2: 'PC2'
                    , 3: 'PC3'}, inplace=True)
dfeval_pca.head()

Unnamed: 0,PC0,PC1,PC2,PC3
0,5207935.0,16249100.0,-169315.0,-2303496.0
1,-829777.5,67625.75,-98883.26,-61256.22
2,-2422555.0,3008502.0,-1993044.0,-1498430.0
3,-2585197.0,-388642.0,-12347.03,19765.31
4,-2585782.0,-400862.1,-5222.108,25226.53


Unimos el dataframe de los campos del PCA con su valor correspondiente del campo isFraud y lo guardaremos en un dataset

In [149]:
print(dfeval_pca.shape)
print(fraudEval.shape)

(80000, 4)
(80000, 1)


In [150]:
evalPca = pd.merge(dfeval_pca,fraudEval, right_index=True, left_index=True)
evalPca.shape

(80000, 5)

Chequeamos que el numero de fraude y no fraude son correctos.

In [151]:
print(pd.value_counts(evalPca['isFraud']))

0.0    79200
1.0      800
Name: isFraud, dtype: int64


**Guardamos el nuevo dataset**  
Guardamos el fichero usando en nuestro bucket de s3 usando la libreria **boto3**

In [152]:
evalPca.to_csv('/tmp/eval_over_pca.csv', index = False)

In [153]:
s3.upload_file(Bucket = BUCKET_NAME, Key = 'eval_over_pca.csv', Filename = '/tmp/eval_over_pca.csv')

In [154]:
# Eliminar el fichero de la ruta tmp de la máquina EC2 para no ocupar espacio.
!rm /tmp/eval_over_pca.csv