# Análisis: Abstract data set for Credit card fraud detection

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sklearn as sl

## Carga de dataset y resumen de datos

Se usará un dataset~\citep{Joshi_2018} el cual correspende al dataframe que se usará durante el análisis.

In [2]:
df = pd.read_csv("./ds/creditcardcsvpresent.csv")

Este dataframe contiene once columnas. Las primeras dos de ellas serán borradas porque una corresponde a un índice de datos y la otra es una columna completamente vacía, por lo tante irrecuperable.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3075 entries, 0 to 3074
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Merchant_id                     3075 non-null   int64  
 1   Transaction date                0 non-null      float64
 2   Average Amount/transaction/day  3075 non-null   float64
 3   Transaction_amount              3075 non-null   float64
 4   Is declined                     3075 non-null   object 
 5   Total Number of declines/day    3075 non-null   int64  
 6   isForeignTransaction            3075 non-null   object 
 7   isHighRiskCountry               3075 non-null   object 
 8   Daily_chargeback_avg_amt        3075 non-null   int64  
 9   6_month_avg_chbk_amt            3075 non-null   float64
 10  6-month_chbk_freq               3075 non-null   int64  
 11  isFradulent                     3075 non-null   object 
dtypes: float64(4), int64(4), object(4)

### Eliminación de columnas

Primero será necesario guardar la columna objetivo `isFraudulent` en una nueva variable, pues será borrada del dataframe de trabajo debido a que utilizaremos métodos de análisis no supervisados.

In [4]:
ideal_results = df["isFradulent"]

Ahora es posible borrar todas las columnas que no son necesarias para el análisis a realizar.

In [5]:
df = df.drop(["Merchant_id", "Transaction date"], axis=1)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3075 entries, 0 to 3074
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Average Amount/transaction/day  3075 non-null   float64
 1   Transaction_amount              3075 non-null   float64
 2   Is declined                     3075 non-null   object 
 3   Total Number of declines/day    3075 non-null   int64  
 4   isForeignTransaction            3075 non-null   object 
 5   isHighRiskCountry               3075 non-null   object 
 6   Daily_chargeback_avg_amt        3075 non-null   int64  
 7   6_month_avg_chbk_amt            3075 non-null   float64
 8   6-month_chbk_freq               3075 non-null   int64  
 9   isFradulent                     3075 non-null   object 
dtypes: float64(3), int64(3), object(4)
memory usage: 240.4+ KB


Este dataframe contiene nueve columnas, las cuales no son descritas en la fuente original, por lo que solo es posible intuir su significado, por supuesto, esto podría condicionar la discusión producto del análisis. Es importante hacer enfásis en proporcionar metadatos sobre cualquier conjunto de datos computables: texto, audio, video, dataset, etc.

Seis de esas columnas son de tipo númerico y las tres restantes son categóricas, enseguida se muestra su descripción general.

In [7]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Average Amount/transaction/day,3075.0,515.026556,291.906978,4.011527,269.788047,502.549575,765.272803,2000.0
Transaction_amount,3075.0,9876.39921,10135.331016,0.0,2408.781147,6698.891856,14422.568935,108000.0
Total Number of declines/day,3075.0,0.957398,2.192391,0.0,0.0,0.0,0.0,20.0
Daily_chargeback_avg_amt,3075.0,55.737561,206.634779,0.0,0.0,0.0,0.0,998.0
6_month_avg_chbk_amt,3075.0,40.022407,155.96884,0.0,0.0,0.0,0.0,998.0
6-month_chbk_freq,3075.0,0.39187,1.548479,0.0,0.0,0.0,0.0,9.0


In [8]:
df.describe(include='object').transpose()

Unnamed: 0,count,unique,top,freq
Is declined,3075,2,N,3018
isForeignTransaction,3075,2,N,2369
isHighRiskCountry,3075,2,N,2870
isFradulent,3075,2,N,2627


In [9]:
for o in ["Is declined", "isForeignTransaction","isHighRiskCountry"]:
    print("-----")
    print(df[o].value_counts())

-----
N    3018
Y      57
Name: Is declined, dtype: int64
-----
N    2369
Y     706
Name: isForeignTransaction, dtype: int64
-----
N    2870
Y     205
Name: isHighRiskCountry, dtype: int64


### Tratamiento de variables categóricas

Se crean variables  separadas, para no usar  variables categóricas. La variable  categórica `Is declined` que toma  valores `Y`  o `N`  en `df["Is declined"]`  se puede  sustituir por  dos variables  dummy, booleanas, que son `Is_declined_Y` y `Is_declined_N`.

In [10]:
dummy_Is_declined = pd.get_dummies(df["Is declined"], prefix="Is_declined")
dummy_Is_declined.tail()

Unnamed: 0,Is_declined_N,Is_declined_Y
3070,0,1
3071,0,1
3072,0,1
3073,0,1
3074,0,1


Ahora se borra la variable original y se adjuntan las nuevas variables al dataframe.

In [None]:
df = df.drop(axis = 1,["Is declined"])
df = pd.concat([df,dummy_Is_declined], axis = 1)