## Data

The example dataset used in this solution was originally released as part of a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

The dataset contains credit card transactions from European cardholders in 2013. As is common in fraud detection, it is highly unbalanced, with 492 fraudulent transactions out of the 284,807 total transactions. The dataset contains only numerical features, because the original features have been transformed for confidentiality using PCA. As a result, the dataset contains 28 PCA components, and two features that haven't been transformed, Amount and Time. Amount refers to the transaction amount, and Time is the seconds elapsed between any transaction in the data and the first transaction.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project



Let's start by downloading and reading in the credit card fraud data set.lets unzip the zip file which contains the fraud dataset add couple of columns to each record. We add an timestamp and a unique identifier.

In [10]:
%%bash
wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/creditcard.csv.zip
unzip creditcard.csv.zip

Archive:  creditcard.csv.zip
  inflating: creditcard.csv          
   creating: __MACOSX/
  inflating: __MACOSX/._creditcard.csv  


--2021-08-23 18:47:52--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/creditcard.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68787176 (66M) [application/zip]
Saving to: ‘creditcard.csv.zip’

     0K .......... .......... .......... .......... ..........  0% 69.1M 1s
    50K .......... .......... .......... .......... ..........  0% 89.7M 1s
   100K .......... .......... .......... .......... ..........  0%  141M 1s
   150K .......... .......... .......... .......... ..........  0% 82.9M 1s
   200K .......... .......... .......... .......... ..........  0% 93.8M 1s
   250K .......... .......... .......... .......... ..........  0% 59.7M 1s
   300K .......... .......... .......... .......... ..........  0% 84.8M 1s
   350K ..........

In [11]:
import pandas as pd

input_csv = pd.read_csv("creditcard.csv")

In [12]:
input_csv.head()

Unnamed: 0,0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,...,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,0.1
0,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
1,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
2,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
3,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
4,2.0,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,...,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.08108,3.67,0


In [10]:
import datetime as dt
 
input_csv["event_time"] =dt.datetime.now()
input_csv["record_id"] = input_csv.index + 1


In [19]:
input_csv.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,event_time,record_id
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0,2021-08-23 10:36:46.009427,1
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0,2021-08-23 10:36:46.009427,2
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0,2021-08-23 10:36:46.009427,3
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0,2021-08-23 10:36:46.009427,4
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0,2021-08-23 10:36:46.009427,5


Upload the file to s3 to use during ML development.

In [21]:
import sagemaker

In [20]:
# Write to csv in S3 without index column.
sagemaker_session = sagemaker.Session()

input_csv.to_csv('credit-dataset.csv',index=False)

inputs = sagemaker_session.upload_data(path='credit-dataset.csv', key_prefix='data/fraud-detection')
display(inputs)