## Fraud Detection with Python - Datacamp

Date:  04/30/2019
Author : Long Nguyen



### Chapter 1: Preparing your data

In this chapter, I will explore dataset provided by the course using pandas.

In [3]:
import pandas as pd
# Dataset will be located in chapter1 folder.
df = pd.read_csv('chapter_1/creditcard_sampledata.csv', index_col=0)
df.info() # getting some information about the dataset

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8000 entries, 0 to 7999
Data columns (total 31 columns):
Time      8000 non-null int64
V1        8000 non-null float64
V2        8000 non-null float64
V3        8000 non-null float64
V4        8000 non-null float64
V5        8000 non-null float64
V6        8000 non-null float64
V7        8000 non-null float64
V8        8000 non-null float64
V9        8000 non-null float64
V10       8000 non-null float64
V11       8000 non-null float64
V12       8000 non-null float64
V13       8000 non-null float64
V14       8000 non-null float64
V15       8000 non-null float64
V16       8000 non-null float64
V17       8000 non-null float64
V18       8000 non-null float64
V19       8000 non-null float64
V20       8000 non-null float64
V21       8000 non-null float64
V22       8000 non-null float64
V23       8000 non-null float64
V24       8000 non-null float64
V25       8000 non-null float64
V26       8000 non-null float64
V27       8000 non-null float64

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,64,1.212511,-0.099054,-1.192094,0.286324,2.160516,3.616314,-0.404207,0.842331,0.16936,...,-0.167496,-0.494695,-0.149785,1.011227,0.883548,-0.329434,0.02037,0.017037,34.7,0
1,64,-0.658305,0.406791,2.037461,-0.291298,0.14791,-0.350857,0.945373,-0.17256,0.025133,...,-0.156096,-0.238805,0.089877,0.421195,-0.352487,0.074783,-0.094192,-0.092493,54.99,0
2,124,1.105253,0.541842,0.839421,2.570933,-0.279517,-0.423062,0.088019,0.011622,-0.715756,...,-0.137434,-0.460991,0.189397,0.556329,0.185786,-0.18903,0.000208,0.026167,6.24,0
3,128,1.239495,-0.182609,0.155058,-0.928892,-0.746227,-1.235608,-0.061695,-0.125223,0.984938,...,0.146077,0.481119,-0.140019,0.538261,0.71072,-0.621382,0.036867,0.010963,8.8,0
4,132,-1.571359,1.687508,0.73467,1.29335,-0.217532,-0.002677,0.147364,0.515362,-0.372442,...,0.048549,0.377256,-0.030436,0.117608,-0.06052,-0.29655,-0.48157,-0.167897,10.0,0


The dataset contains 8000 rows with 31 columns

In [5]:
# Count the occurrences of fraud vs. no fraud
occ = df['Class'].value_counts()
occ

0    7983
1      17
Name: Class, dtype: int64

In [7]:
occ / len(df) *100 # ratio

0    99.7875
1     0.2125
Name: Class, dtype: float64

In [9]:
# import matplotlib
import matplotlib.pyplot as plt
# Define a function to create a scatter plot of our data and labels
def plot_data(X, y):
	plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
	plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
	plt.legend()
	return plt.show()

### Introducing sampling technique such as undersampling, oversampling, and SMOTE

* Undersampling - sample from the majority class so that majority (none-fraud) equal the number of fraud cases.
* Oversampling - duplicate the number of fraud cases.

In [12]:
# Examine the mean of each feature using groupby('Class')
import numpy as np
df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,95383.605161,0.023553,-0.008543,-0.002408,-0.034757,-0.020108,0.003264,0.023668,-0.00472,-0.01066,...,-0.00692,-0.004279,0.027999,0.006689,-0.00055,-0.000616,0.000659,-0.00674,0.005625,89.721167
1,83000.176471,-3.235382,1.317054,-3.762234,2.80114,-0.941354,-1.184692,-3.527826,-0.108892,-1.546536,...,-0.29245,0.361582,0.15921,-0.119366,-0.144082,0.155552,0.038765,0.265996,0.131853,65.287647


In [13]:
df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3, df['V3']<-5), 1, 0)
print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))

Flagged Fraud     0   1
Actual Fraud           
0              7949  34
1                13   4


*The crosstab function is quite similar to the table() in R