## Data

The example dataset used in this solution was originally released as part of a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

The dataset contains credit card transactions from European cardholders in 2013. As is common in fraud detection, it is highly unbalanced, with 492 fraudulent transactions out of the 284,807 total transactions. The dataset contains only numerical features, because the original features have been transformed for confidentiality using PCA. As a result, the dataset contains 28 PCA components, and two features that haven't been transformed, Amount and Time. Amount refers to the transaction amount, and Time is the seconds elapsed between any transaction in the data and the first transaction.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project



Let's start by downloading the credit card fraud data set.lets unzip the zip file which contains the fraud dataset add couple of columns to each record. We add an timestamp and a unique identifier.

In [2]:
%%bash
wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/creditcard.csv.zip
unzip creditcard.csv.zip

Archive:  creditcard.csv.zip
  inflating: creditcard.csv          
   creating: __MACOSX/
  inflating: __MACOSX/._creditcard.csv  


--2021-11-24 05:30:40--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/creditcard.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68787176 (66M) [application/zip]
Saving to: ‘creditcard.csv.zip’

     0K .......... .......... .......... .......... ..........  0% 66.5M 1s
    50K .......... .......... .......... .......... ..........  0%  134M 1s
   100K .......... .......... .......... .......... ..........  0% 56.7M 1s
   150K .......... .......... .......... .......... ..........  0% 46.5M 1s
   200K .......... .......... .......... .......... ..........  0%  202M 1s
   250K .......... .......... .......... .......... ..........  0%  121M 1s
   300K .......... .......... .......... .......... ..........  0%  152M 1s
   350K ..........

In [3]:
import pandas as pd

input_csv = pd.read_csv("creditcard.csv",names=['time', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10',
       'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19', 'v20',
       'v21', 'v22', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'amount',
       'class'])

In [17]:
input_csv = input_csv.sample(n=100000)

In [18]:
input_csv.head()

Unnamed: 0,time,v1,v2,v3,v4,v5,v6,v7,v8,v9,...,v23,v24,v25,v26,v27,v28,amount,class,event_time,record_id
112044,72488.0,1.278061,0.480101,-0.11858,1.110239,0.007263,-1.202533,0.528901,-0.336875,-0.228962,...,-0.140618,0.397477,0.841722,-0.295285,-0.009945,0.010135,0.89,0,2021-11-24 05:32:49.554232,112045
256771,157851.0,0.133842,0.913407,-0.586069,-0.752408,1.09917,-0.279002,0.812983,0.111391,-0.149919,...,0.041623,0.067257,-0.42072,0.124742,0.212057,0.067958,4.56,0,2021-11-24 05:32:49.554232,256772
113264,72998.0,1.136992,0.106427,0.28939,0.892783,-0.073193,0.074396,-0.12131,0.16947,-0.044013,...,-0.046173,-0.339419,0.444799,-0.404998,0.025261,0.009647,20.98,0,2021-11-24 05:32:49.554232,113265
149565,91472.0,0.116716,0.633029,1.205616,0.007579,-0.047984,-0.306298,0.226138,-0.134629,1.841086,...,0.013697,-0.013387,-1.212608,-0.381238,0.310468,0.277677,12.99,0,2021-11-24 05:32:49.554232,149566
222774,143111.0,0.200281,-0.553021,1.272415,-2.587289,-0.874545,-1.546157,0.182216,-0.54588,-2.002891,...,-0.04077,0.928512,0.011134,-0.338252,-0.183966,-0.265404,15.0,0,2021-11-24 05:32:49.554232,222775


Add 2 fields which is required to track version and uniquely identity record when using feature store.

In [5]:
import datetime as dt
 
input_csv["event_time"] =dt.datetime.now()
input_csv["record_id"] = input_csv.index + 1


In [6]:
input_csv.head()

Unnamed: 0,time,v1,v2,v3,v4,v5,v6,v7,v8,v9,...,v23,v24,v25,v26,v27,v28,amount,class,event_time,record_id
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0,2021-11-24 05:32:49.554232,1
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0,2021-11-24 05:32:49.554232,2
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0,2021-11-24 05:32:49.554232,3
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0,2021-11-24 05:32:49.554232,4
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0,2021-11-24 05:32:49.554232,5


Upload the file to s3 to use during ML development.

In [7]:
import sagemaker

In [8]:
# Write to csv in S3 without index column.
sagemaker_session = sagemaker.Session()

input_csv.to_csv('credit-dataset.csv',index=False)

inputs = sagemaker_session.upload_data(path='credit-dataset.csv', key_prefix='data/fraud-detection')
display(inputs)

's3://sagemaker-us-east-1-365792799466/data/fraud-detection/credit-dataset.csv'

We have successfully added the training dataset into s3 !!!