### Carregamento dos Dados

In [2]:
import pandas as pd

df = pd.read_csv('../creditcard.csv')

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [4]:
df.shape

(283171, 31)

### Exploração dos Dados

In [12]:
# Filtragem de nulos no dataframe

df.isnull().sum().sum()
df.dropna(inplace=True)

O resultado nos diz que não há nulos...

In [13]:
# Filtragem de duplicados no dataframe

df[df.duplicated()].shape

(0, 31)

In [14]:
df[df.duplicated()][df['Class'] == 1].shape

  df[df.duplicated()][df['Class'] == 1].shape


(0, 31)

Temos 1081 transações duplicadas, sendo que 19 delas são da classe minoritária (fraudulenta). Vamos eliminá-las...

In [15]:
df = df.drop_duplicates()

df[df.duplicated()].shape

(0, 31)

### Transformação dos Dados e SMOTE

Antes de aplicarmos o K-means, precisamos transformar nossos dados para um formato que melhor se adeque ao modelo. Em adicão a isso, aplicamos o SMOTE para que possamos compensar a classe minoritária.

In [16]:
from imblearn.over_sampling import SMOTE

Dividimos os dados em X e y para aplicarmos o SMOTE.

In [17]:
X = df.drop('Class', axis=1) 
y = df['Class']

Aplicamos o SMOTE.

In [18]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [19]:
df_resampled = pd.DataFrame(X_resampled, columns=X.columns)
df_resampled['Class'] = y_resampled

Juntamos os dados para um csv maior. Em seguida, realizamos Standardization para diminuir a sensibilidade à valores muito altos ao aplicar o K-means.

In [20]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_resampled["Amount"] =  scaler.fit_transform(df_resampled["Amount"].values.reshape(-1,1))


Visualizacao dos dados tratados.

In [21]:
df_resampled

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.000000,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.239370,0.0
1,0.000000,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,-0.424901,0.0
2,1.000000,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.274860,0.0
3,1.000000,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.121281,0.0
4,2.000000,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,-0.120637,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
563235,154021.180123,-6.151978,-1.487382,-4.893660,2.896103,0.463398,-3.166797,-4.169978,1.524096,-1.890787,...,1.171334,-0.994378,0.313069,-0.243100,-0.165280,1.236037,-0.669807,-0.585297,0.261710,1.0
563236,48359.577927,-2.649776,-1.334328,0.885839,1.926523,-0.141272,-0.539475,0.653850,-0.317152,0.519650,...,-0.357123,0.414452,0.741434,0.420443,0.258677,-0.133946,0.081950,0.354275,0.463942,1.0
563237,41202.923605,-8.372807,6.105246,-11.622985,6.699396,-8.100926,-3.714798,-12.366592,5.574975,-6.222133,...,2.188050,-0.279497,0.009196,0.403336,-0.020379,0.523873,0.819582,0.075949,-0.303293,1.0
563238,41567.118936,-1.956009,1.740479,-1.227773,2.789663,-1.863577,-0.246224,-2.990720,1.266762,-1.507698,...,0.677427,0.529681,-0.139876,0.151106,0.074212,-0.161577,0.386903,0.019372,-0.260275,1.0


In [22]:
df_resampled.to_csv('creditcard_treated.csv', index=False)

Exportamos os dados para um csv, para uso posteriori.