### Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

### Content
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Inspiration
Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

## Import libraries and data

In [43]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, recall_score, precision_score, precision_recall_curve
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

import os
os.chdir("/Users/amit.sood/Documents/Analytics/grail")

data = pd.read_csv("creditcard.csv")
print('Shape of the data ')
data.shape

Shape of the data 


(284807, 31)

As stated in the problem statement, this is a highly imbalanced data. Lets see the distribution of Class variable

In [120]:
print('Data is highly imbalanced and this needs to be addressed while modeling the data')
data['Class'].value_counts()


Data is highly imbalanced and this needs to be addressed while modeling the data


0    284315
1       492
Name: Class, dtype: int64

In [46]:
print('Quick summary of each column. Nothing major here since almost all the cloumns are components.Extremely difficult to get anything of of these components')
print('\nOnly Time and Amount features can be used to do some sort of feature engineering')
data.describe()


Quick summary of each column. Nothing major here since almost all the cloumns are components.Extremely difficult to get anything of of these components

Only Time and Amount features can be used to do some sort of feature engineering


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


# Model 1 - Under-sampling of the majority class
Since we already know that the dataset is imbalanced i am going to try undersampling of the majority class and test the model.

### Approach: 
I am going to take samples of 10000 rows and merge each sample with the minority class (492 records). So i am going to train nearly 28 models of 10492 records each. 

### Accuracy Metric 
I will be using Confusion metrices for calculating Recall and Precision of each model. Simply using Accuracy will not work because that's going to be high anyways because of the imbalance in the data


In [47]:
print('Spliting class 0 and class 1 data.')
data_class0 = data.loc[data['Class']==0]
data_class1 = data.loc[data['Class']==1]
print('Shape of class 1 data')
data_class1.shape



Spliting class 0 and class 1 data.
Shape of class 1 data


(492, 31)

In [48]:
xgb = xgb.XGBClassifier()

samples =[];
sample_size=10000
numberOfSamples = data.shape[0]/sample_size
oldLimit=0;

for count in range(numberOfSamples):
    print("-------------------------------------------------")
    newLimit = (count+1)* sample_size
    underSampleData = data_class0[oldLimit:newLimit].append(data_class1)
    oldLimit= newLimit

    from sklearn.utils import shuffle
        
    X=underSampleData.drop('Class', axis=1)
    y=underSampleData['Class']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Convert values into an array
    X_train = X_train.values
    X_test = X_test.values
    y_train = y_train.values
    y_test = y_test.values
        
    # Train

    xgb.fit(X_train, y_train)
    y_pred= xgb.predict(X_test)
    
    cm = confusion_matrix(y_test, y_pred)    
    print('Confusion Matrix : ')
    print(cm)

    print('Recall : '),;print(recall_score(y_test, y_pred))
    print('Precision : '),;print(precision_score(y_test, y_pred))


-------------------------------------------------
Confusion Matrix : 
[[3006    0]
 [   1  141]]
Recall :  0.9929577464788732
Precision :  1.0
-------------------------------------------------
Confusion Matrix : 
[[3006    0]
 [   5  137]]
Recall :  0.9647887323943662
Precision :  1.0
-------------------------------------------------
Confusion Matrix : 
[[3005    1]
 [   2  140]]
Recall :  0.9859154929577465
Precision :  0.9929078014184397
-------------------------------------------------
Confusion Matrix : 
[[3006    0]
 [   0  142]]
Recall :  1.0
Precision :  1.0
-------------------------------------------------
Confusion Matrix : 
[[3004    2]
 [   1  141]]
Recall :  0.9929577464788732
Precision :  0.986013986013986
-------------------------------------------------
Confusion Matrix : 
[[3006    0]
 [   4  138]]
Recall :  0.971830985915493
Precision :  1.0
-------------------------------------------------
Confusion Matrix : 
[[3005    1]
 [   3  139]]
Recall :  0.9788732394366197
Pre

### Onservations
* Model is consistent across all the samples meaning the distribution of the samples is fine.
* Recall & precision is also very high which is a good sign.

### Further testing the model.
* While the performance of each model is good and consistent, i need to be careful. I need to check if the model is not over fitting and genralizing to all the data. So to test that i am going to take the last model and try predicting for all the data i have and see if it generalizes for the populations.

In [31]:
X=data.drop('Class', axis=1)
y=data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)
    
# Convert values into an array
X_train = X_train.values
y_train = y_train.values

y_pred= xgb.predict(X_train)
    
cm = confusion_matrix(y_train, y_pred)    
print(cm)

print(recall_score(y_train, y_pred))
print(precision_score(y_train, y_pred))


[[ 15427 268604]
 [     5    486]]
0.9898167006109979
0.0018060871827269686


### Onservations
* Recall is still very high but Precision suffers heavily. This means we will end up classifying a lot of good transactions as fraud, leading to a poor customer experience. Customer would eventually churn which would be bad for the business.
* Model does not generalize to population. 

## Summary of Under-sampling approach

Undersampling of the majority class didn't work well. So i need to try something else.

# Model 2- Train only using XGBoost with all data

In [105]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
data = pd.read_csv("creditcard.csv")
xgb = xgb.XGBClassifier(max_depth=3, learning_rate=0.1,n_estimators=100)

X=data.drop('Class', axis=1)
y=data['Class']
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
# Convert values into an array
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
        
# Train
xgb.fit(X_train, y_train)
y_pred= xgb.predict(X_test)
x_train_pred_prob= xgb.predict_proba(X_train)
x_test_pred_prob= xgb.predict_proba(X_test)

    
cm = confusion_matrix(y_test, y_pred)    
print('Confusion Matrix on Test Set : ')
print(cm)

print('Recall : '),;print(recall_score(y_test, y_pred))
print('Precision : '),;print(precision_score(y_test, y_pred))


Confusion Matrix : 
[[85300     7]
 [   23   113]]
Recall :  0.8308823529411765
Precision :  0.9416666666666667


In [106]:
#X_train = np.hstack((X_train, x_train_pred_prob))
#X_test = np.hstack((X_test, x_test_pred_prob))

In [108]:
'''
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
#data = pd.read_csv("creditcard.csv")
xgb = xgb.XGBClassifier(max_depth=3, learning_rate=0.1,n_estimators=100)

# Train
xgb.fit(X_train, y_train)
y_pred= xgb.predict(X_test)

    
cm = confusion_matrix(y_test, y_pred)    
print('Confusion Matrix : ')
print(cm)

print('Recall : '),;print(recall_score(y_test, y_pred))
print('Precision : '),;print(precision_score(y_test, y_pred))
''''

Confusion Matrix : 
[[85300     7]
 [   24   112]]
Recall :  0.8235294117647058
Precision :  0.9411764705882353


In [72]:
X=data.drop('Class', axis=1)
y=data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)
    
# Convert values into an array
X_train = X_train.values
y_train = y_train.values

y_pred= xgb.predict(X_train)


cm = confusion_matrix(y_train, y_pred)    
print('Confusion Matrix on almost all data')
print(cm)

print(recall_score(y_train, y_pred))
print(precision_score(y_train, y_pred))


[[284023      8]
 [    36    455]]
0.9266802443991853
0.9827213822894169


### Observation
* On the test set the Recall is 0.83 and precision is 0.94
* While this model doesn't perform better then the under-sampling model above but we should notice its on a bigger chunk of the data. What would be interesting will be to check if this generalizes better then under-sampling models

In [73]:
y_pred_prob

array([[9.99945402e-01, 5.46129668e-05],
       [9.99981225e-01, 1.87548412e-05],
       [9.99998987e-01, 1.00437421e-06],
       ...,
       [9.99957681e-01, 4.23438141e-05],
       [9.99999881e-01, 1.05019225e-07],
       [9.99747038e-01, 2.52938014e-04]], dtype=float32)

### Observation
* Model is consitent with test set.
* This models generalizes better then the under-sampling model. 
* Precision is good. Recall can be improved further.

# Model 3 - Custom Smote Model

In [109]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.utils import resample
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix, recall_score, precision_score, precision_recall_curve

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

import xgboost as xgb
from xgboost.sklearn import XGBClassifier


In [110]:
data = pd.read_csv("creditcard.csv")

data1 = data.loc[data['Class']== 1]
data0 = data.loc[data['Class']== 0]


data1_time = data1['Time']
data1_class = data1['Class']
data1_amount = data1['Amount']


data1= data1.drop(['Time', 'Class', 'Amount'], axis=1)
data0= data0.drop(['Time', 'Amount'], axis=1)

## Adding artifical row of class 1 only. 

In [111]:
for count in range(len(data1)):
    row = data1.iloc[[count]]
    for count in range(100):
       newRow =  row + np.random.normal(0, 1, 28)
       data1 =pd.concat([data1, newRow])

data1['Class'] =1
data = pd.concat([data1,data0])


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [113]:

import xgboost as xgb
xgb = xgb.XGBClassifier(max_depth=3, learning_rate=0.1,n_estimators=100 )
#xgb = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
#xgb = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
#xgb = GaussianNB()

X=data.drop('Class', axis=1)
y=data['Class']
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape
# Convert values into an array
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
        
# Train
xgb.fit(X_train, y_train)
y_pred= xgb.predict(X_test)
    
cm = confusion_matrix(y_test, y_pred)    
print('Confusion Matrix : ')
print(cm)

print('Recall : '),;print(recall_score(y_test, y_pred))
print('Precision : '),;print(precision_score(y_test, y_pred))

Confusion Matrix : 
[[85107   107]
 [  552 14437]]
Recall :  0.9631729935285877
Precision :  0.9926430143014301


### Observations
* This model produces the highest Recall and Precision.


Below i am going to bring back original 492 rows of class 1 and test the model and check how well it performs


In [114]:
data_new = pd.read_csv("creditcard.csv")
data_new1 = data_new

data_new1= data_new1.drop(['Time', 'Amount'], axis=1)
X=data_new1.drop('Class', axis=1)
y=data_new1['Class']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

y_pred= xgb.predict(X_train)
    
cm = confusion_matrix(y_train, y_pred)    
print('Confusion Matrix : ')
print(cm)

print('Recall : '),;print(recall_score(y_train, y_pred))
print('Precision : '),;print(precision_score(y_train, y_pred))


Confusion Matrix : 
[[255608    272]
 [    74    372]]
Recall :  0.8340807174887892
Precision :  0.577639751552795


### Observation
* Seems like model was overfitting.
* Both recall and precision has taken a hit.

# Summary
* Model 2 Still has the best Recall and Precision 
* It generalizes the best as well 

# What More i could have tried
* Create Time related features like, Time of the day.
* Create Amount related features like transactiontype (low, medium, high)
* Try Ensemble of couple of models
* Use soft probabilties and try to categorize based on a threshold instead of the default 0.5. Needs domain knowledge here.


 