# Data Cleaning

## Objectives
* Remove any irrelevant features
* Check for and deal with null values
* Check for and deal with rediculously high or low values
* Make sure all the data types are correct

I will start off by importing the necessary libraries.

In [13]:
# Import necessary libraries
import numpy as np # for np.nan
import pandas as pd # dataframe
import datetime # time
import pickle

Now I will load the .csv file into a dataframe and preview it.

## Pandas & Pickle

In [14]:
# Load dataframe and preview it
pickle_in = open("creditcard.pickle", "rb")
df = pickle.load(pickle_in)
pickle_in.close()

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


#### Remove any irrelevant features

There are no irrelevant features.

#### Check for and deal with null values

Now I will check how many null values are in the dataframe

In [15]:
# Find the count of null values in the dataframe
df.isna().sum().sum()

0

Excellent, there are no null values.

#### Check for and deal with rediculously high or low values

To be clear I am not looking for outliers at this stage, just inproportianate values that might be caused by errors in data entry. Typically these will be all nines. I will check for these now by looking at each columns minimum and maximum values.

In [16]:
# Check each columns minimum values
df.min()

Time        0.000000
V1        -56.407510
V2        -72.715728
V3        -48.325589
V4         -5.683171
V5       -113.743307
V6        -26.160506
V7        -43.557242
V8        -73.216718
V9        -13.434066
V10       -24.588262
V11        -4.797473
V12       -18.683715
V13        -5.791881
V14       -19.214325
V15        -4.498945
V16       -14.129855
V17       -25.162799
V18        -9.498746
V19        -7.213527
V20       -54.497720
V21       -34.830382
V22       -10.933144
V23       -44.807735
V24        -2.836627
V25       -10.295397
V26        -2.604551
V27       -22.565679
V28       -15.430084
Amount      0.000000
Class       0.000000
dtype: float64

In [17]:
# Check each columns maximum values
df.max()

Time      172792.000000
V1             2.454930
V2            22.057729
V3             9.382558
V4            16.875344
V5            34.801666
V6            73.301626
V7           120.589494
V8            20.007208
V9            15.594995
V10           23.745136
V11           12.018913
V12            7.848392
V13            7.126883
V14           10.526766
V15            8.877742
V16           17.315112
V17            9.253526
V18            5.041069
V19            5.591971
V20           39.420904
V21           27.202839
V22           10.503090
V23           22.528412
V24            4.584549
V25            7.519589
V26            3.517346
V27           31.612198
V28           33.847808
Amount     25691.160000
Class          1.000000
dtype: float64

There appears to be no data entry errors.

#### Check for duplicates

In [18]:
df.Class.value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

In [19]:
df.duplicated().sum()

1081

In [20]:
# Drop duplicates
df = df.drop_duplicates()

In [21]:
df.Class.value_counts(normalize=True)

0    0.998333
1    0.001667
Name: Class, dtype: float64

In [22]:
df.duplicated(df.columns[:-1]).sum()

0

#### Make sure all the data types are correct

Now I will check the dataframe's data types.

In [23]:
# Check dataframe's data types
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

I would normally change the time feature to datetime format, but that does not seem practical here.

All the datatypes are correct.

##### Save data

In [24]:
pickle_out = open("clean_data.pickle", "wb")
pickle.dump(df, pickle_out)
pickle_out.close()

## Sources
1: The kaggle post from which I obtained this dataset https://www.kaggle.com/mlg-ulb/creditcardfraudime <br>