# Step 2: Dataset cleaning

## Setting up the environment

We start by loading the required libraries:

In [1]:
import kaggle
import os
import pandas as pd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt



We set the max rows and columns for display:

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)

## Loading the data

We download the dataset using the Kaggle API which requires us to have an API key:

In [3]:
if not os.path.isfile("data/fraudTrain.csv") or not os.path.isfile("data/fraudTest.csv"):
    kaggle.api.authenticate()
    kaggle.api.dataset_download_files('kartik2112/fraud-detection', path='./data', unzip=True)

We read both datasets (train and test):

In [4]:
df_train = pd.read_csv('data/fraudTrain.csv', index_col=0)
df_test = pd.read_csv('data/fraudTest.csv', index_col=0)

## Cleaning the data

We cast the time-related values to `Datetime`:

In [5]:
# Train
df_train['trans_date_trans_time'] = pd.to_datetime(df_train['trans_date_trans_time'], errors='ignore')
df_train['dob'] = pd.to_datetime(df_train['dob'], errors='ignore') # day of birth

# Test
df_test['trans_date_trans_time'] = pd.to_datetime(df_test['trans_date_trans_time'], errors='ignore')
df_test['dob'] = pd.to_datetime(df_test['dob'], errors='ignore')

We drop irrelevant or redundant data since we already have geolocation and time data contained in other values, thus reducing its dimension:

In [6]:
drop_columns = ["first","last","street","city","state",
                "zip","trans_num","unix_time"]
df_train.drop(columns=drop_columns, inplace=True)
df_test.drop(columns=drop_columns, inplace=True)

We do the calculus of the age of each card user by subtracting the birthdate to the current year:

In [7]:
df_train['age'] = np.round((df_train['trans_date_trans_time'] - df_train['dob']) / np.timedelta64(1,'Y'))
df_train = df_train.astype({'age': 'int64'})

df_test['age'] = np.round((df_test['trans_date_trans_time'] - df_test['dob']) / np.timedelta64(1,'Y'))
df_test = df_test.astype({'age': 'int64'})

drop_columns = ["dob"]

df_train.drop(columns=drop_columns, inplace=True)
df_test.drop(columns=drop_columns, inplace=True)

We rename the columns for displaying purposes:

In [8]:
df_train.head()
trans_dict = {"trans_date_trans_time":"timestamp","cc_num":"credit_card_num","merchant":"shop",
             "amt":"amount"}
df_train.rename(columns=trans_dict, inplace=True)
df_test.rename(columns=trans_dict, inplace=True)

## Saving the results

We save the resulting clean dataset:

In [9]:
df_train.to_csv("data/clean_fraudTrain.csv",index=False)
df_test.to_csv("data/clean_fraudTest.csv",index=False)