<a href="https://colab.research.google.com/github/aghabayli/1st_fraud_payment_detection/blob/master/1st_fraud_payment_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [69]:
#Connect to the drive folder
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing, tree, metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, roc_curve, auc
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.cluster import KMeans

In [0]:
#Load dataset
payments = pd.read_csv('gdrive/My Drive/1st_adyen_rides-success-and-fail.csv')

# Data Exploration

First, we will clarify the following: which features we have, which are relevant, what type of they are, how balanced data set. 

In [72]:
#Examples from the data set
payments.head()

Unnamed: 0,created,device_name,device_os_version,country,city_id,lat,lng,real_destination_lat,real_destination_lng,user_id,order_id,order_try_id,distance,ride_distance,price,ride_price,price_review_status,price_review_reason,is_successful_payment,name,card_bin,failed_attempts
0,2016-01-23 23:10:07,motorolaXT1562,motorola6.0.1,ee,2.0,58.37822,26.710402,58.363243,26.737696,218,4047728,4054895,773,3017,4.5,4.5,ok,,1,**** 0810,,0
1,2016-05-04 06:01:32,iPhone6,iOS10.3.3,ee,1.0,59.42413,24.646359,59.397548,24.660957,266,5093642,5129745,43,4241,4.4,4.4,ok,,1,**** 9115,,0
2,2016-08-27 16:42:22,HTCHTC 10,HTC7.0,ee,1.0,59.413508,24.743706,59.4485,24.804887,551,6655300,6792534,1654,6347,7.2,7.2,ok,,1,**** 0634,516903.0,0
3,2016-10-25 07:14:27,iPhone6S,iOS10.3.2,ee,1.0,59.419938,24.744795,59.431686,24.720801,798,7874827,8103655,883,2638,3.1,3.1,ok,,1,**** 8730,541747.0,0
4,2016-09-09 12:46:47,"iPhone5,2",iOS9.3.4,ee,1.0,59.471328,24.890557,59.427836,24.77446,944,6879043,7039724,1109,10288,9.0999,9.0999,ok,,1,**** 3503,,0


I tried to construct timestamp, in a way that the repetitions are possible. Extracted year, month, day, hour, weekday as a separate column. After testing the results for initial and reconstructed, I did not see improvements, so did not include to the final version.

In [0]:
#Constact from date week number
#payments = payments.assign(month = payments['created'].apply(lambda x : pd.to_datetime(x).week))

#Constact from date week number
#payments = payments.assign(year = payments['created'].apply(lambda x : pd.to_datetime(x).week))

#Constact from date week number
#payments = payments.assign(week = payments['created'].apply(lambda x: pd.to_datetime(x).week))

#Constact from date hours
#payments = payments.assign(hour = payments['created'].apply(lambda x: pd.to_datetime(x).hour))

#Constact from date weekdays
#payments = payments.assign(weekday = payments['created'].apply(lambda x: pd.to_datetime(x).weekday()))


In [73]:
#Information about fields
payments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304053 entries, 0 to 304052
Data columns (total 22 columns):
created                  304053 non-null object
device_name              304053 non-null object
device_os_version        304053 non-null object
country                  304052 non-null object
city_id                  303734 non-null float64
lat                      304053 non-null float64
lng                      304053 non-null float64
real_destination_lat     304026 non-null float64
real_destination_lng     304026 non-null float64
user_id                  304053 non-null int64
order_id                 304053 non-null int64
order_try_id             304053 non-null int64
distance                 304053 non-null int64
ride_distance            304053 non-null int64
price                    304053 non-null float64
ride_price               304053 non-null float64
price_review_status      304053 non-null object
price_review_reason      1108 non-null object
is_successful_payment    

In [114]:
#Procent of fraudelent transactions (minority class)
print("Precent of fraudulent transactions: ")
payments.is_successful_payment.value_counts()[0]/(payments.is_successful_payment.value_counts()[0]+payments.is_successful_payment.value_counts()[1])

Precent of fraudulent transactions: 


0.25993165665196527

We can see that our data set is unbalanced, as contains 25 percent fraudulent and 75 normal transactions. Usually, 40%-60%, 50%-50% are considered as balanced. First, I will work with original data, then balance and try again.

The features with unique values are irrelavant in our case. We constructed the labels vector and removed the following columns: 'created', 'user_id', 'order_id', 'order_try_id', 'name', 'card_bin', 'is_successful_payment' from input data. 'user_id', 'order_id' and 'order_try_id' are unique. As a usage of the identical card fraudelent transaction were handled before, we removed 'name' and 'card_bin'.

In [0]:
#Splitting data into labels and features data

y = payments.is_successful_payment

#Delete unique features
X = payments.drop(['created', 'user_id', 'order_id', 'order_try_id', 'name', 'card_bin', 'is_successful_payment'], axis=1)

Then, we encode categorical values to labels.

In [0]:
#Encoding to labels features contained string values
encode = preprocessing.LabelEncoder()
X['device_name'] = encode.fit_transform(X.device_name)
X['device_os_version'] = encode.fit_transform(X.device_os_version)
X['country'] = encode.fit_transform(X.country.fillna('nan'))
X['price_review_status'] = encode.fit_transform(X.price_review_status)
X['price_review_reason'] = encode.fit_transform(X.price_review_reason.fillna('nan'))

In [82]:
#Checking for missing values
X.isnull().sum()

device_name               0
device_os_version         0
country                   0
city_id                 319
lat                       0
lng                       0
real_destination_lat     27
real_destination_lng     27
distance                  0
ride_distance             0
price                     0
ride_price                0
price_review_status       0
price_review_reason       0
failed_attempts           0
dtype: int64

There are missing values in city id, real destination latitude, and longitude. All remaining missing values were filled with zero.

In [0]:
#Fill with zero missing values
X = X.fillna(0)

We split data set to train and test as 70% and 30%.

In [0]:
#Split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Training

I decided to start the training process with logistic regression, to be able to see how useful each feature.

In [85]:
#Using Logit to select features
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.519478
         Iterations 7
                            Results: Logit
Model:              Logit                 Pseudo R-squared: 0.093      
Dependent Variable: is_successful_payment AIC:              221158.3776
Date:               2019-07-07 08:38      BIC:              221312.4018
No. Observations:   212837                Log-Likelihood:   -1.1056e+05
Df Model:           14                    LL-Null:          -1.2189e+05
Df Residuals:       212822                LLR p-value:      0.0000     
Converged:          1.0000                Scale:            1.0000     
No. Iterations:     7.0000                                             
-----------------------------------------------------------------------
                        Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
-----------------------------------------------------------------------
device_name             0.0001   0.0000   1.8364 0.0663 -0.0000

The features with higher than 0.05 p values are less significant features. Device os version, latitude and longitude, and device name information are less significant. The reason can be as well need for more feature engineering. 

The next I excluded less significant features and trained logistic regression model.

In [117]:
#Delete features with p>0.05
X1 = X.drop(['device_name', 'device_os_version', 'lat', 'lng', 'real_destination_lat', 'real_destination_lng', 'distance'], axis=1)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y, test_size=0.3, random_state=0)

#Logistic regression
logreg = LogisticRegression()

#Train logistic regression model
logreg.fit(X_train1, y_train1)

predicted = logreg.predict(X_test1)

#Accuracy
print("Accuracy for the test data: {}".format(accuracy_score(y_test1, predicted)))

#Calculate confusion matrix
print("Confusion matrix for the test data: \n{}".format(confusion_matrix(y_test1, predicted)))

#F measure
print("F-measure for the test data: {}".format(f1_score(y_test1, predicted)))



Accuracy for the test data: 0.7474017716190142
Confusion matrix for the test data: 
[[ 1653 22118]
 [  923 66522]]
F-measure for the test data: 0.8523817150911362


As both false positives and false negatives are important for the problem. We want to identify more frauds and at the same time do not bother honest users. I will use F-measure and accuracy to compare the results.

We can see that, logistic regression showed not very good results. The number of false positives is high. Accuracy is 0.74, f-measure is 0.85.

### Decision tree

I decided to move with the Decision Tree algorithm,  which is more stable for unbalanced data sets and have a good fit for the problem. 

In [96]:
#Decision tree prediction
clf = tree.DecisionTreeClassifier()

#Apply cross validation
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
print("Accuracy for the validation data sets: {}".format(scores.mean()))

#Train decision tree model on all train data
clf.fit(X_train, y_train)

#Predict labels of test data
predicted = clf.predict(X_test)

#Accuracy
print("Accuracy for the test data: {}".format(accuracy_score(y_test, predicted)))

#Calculate confusion matrix
print("Confusion matrix for the test data: \n{}".format(confusion_matrix(y_test, predicted)))

#F measure
print("F-measure for the test data: {}".format(f1_score(y_test, predicted)))

Accuracy for the validation data sets: 0.9124494235641395
Accuracy for the test data: 0.9115725311348886
Confusion matrix for the test data: 
[[19901  3870]
 [ 4196 63249]]
F-measure for the test data: 0.9400582622395292


The accuracy and f-measure results seem to be not bad. As decision tree algorithm worked well, as a next step I will try tree ensembling methods Random Forest and Ada Boost.

### Random forest

In [102]:
#Create a model with 5 trees
rf = RandomForestClassifier(n_estimators = 5, random_state = 0)

#Apply cross validation
scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
print("Accuracy for the validation data sets: {}".format(scores.mean()))

#Train the model
rf.fit(X_train,y_train)

#Predict test data
predicted = rf.predict(X_test)

#Accuracy
print("Accuracy for the test data: {}".format(accuracy_score(y_test, predicted)))

#Calculate confusion matrix
print("Confusion matrix for the test data: \n{}".format(confusion_matrix(y_test, predicted)))

#F measure
print("F-measure for the test data: {}".format(f1_score(y_test, predicted)))

Accuracy for the validation data sets: 0.9440604729671067
Accuracy for the test data: 0.9428609015962112
Confusion matrix for the test data: 
[[19607  4164]
 [ 1048 66397]]
F-measure for the test data: 0.9622335260785763


We can see that, the results were improved the accuracy is 94% and f-measure is 0.96.

In [110]:
#Ada Boost Classifier
ada = AdaBoostClassifier(n_estimators=5, learning_rate=1)

#Apply cross validation
scores = cross_val_score(ada, X_train, y_train, cv=5, scoring='accuracy')
print("Accuracy for the validation data sets: {}".format(scores.mean()))

#Train the model
ada.fit(X_train,y_train)

#Predict test data
predicted = ada.predict(X_test)

#Accuracy
print("Accuracy for the test data: {}".format(accuracy_score(y_test, predicted)))

#Calculate confusion matrix
print("Confusion matrix for the test data: \n{}".format(confusion_matrix(y_test, predicted)))

#F measure
print("F-measure for the test data: {}".format(f1_score(y_test, predicted)))

Accuracy for the validation data sets: 0.946513060761454
Accuracy for the test data: 0.9445163129275566
Confusion matrix for the test data: 
[[18750  5021]
 [   40 67405]]
F-measure for the test data: 0.9638166596363792


# Clustering Problem

As a next method, I will implement the problem as if we are not able to label data set beforehand. I will look at the problem as an unsupervised and apply K Means method to group users to the 2 group considering similarity.

In [113]:
#K Means
kmeans = KMeans(n_clusters=2, random_state=0)

#Train
kmeans.fit(X_train)

#Predict
predicted = kmeans.predict(X_test)

#Accuracy
print("Accuracy for the test data: {}".format(accuracy_score(y_test, predicted)))

#Calculate confusion matrix
print("Confusion matrix for the test data: \n{}".format(confusion_matrix(y_test, predicted)))

#F measure
print("F-measure for the test data: {}".format(f1_score(y_test, predicted)))

Accuracy for the test data: 0.6741032275039467
Confusion matrix for the test data: 
[[ 1804 21967]
 [ 7760 59685]]
F-measure for the test data: 0.8006197307792914


As expected, the results of the clustering method were lower than the classification. The accuracy of the model is 0.67 and f-measure 0.8. From the confusion matrix, we see that, the number of false positives (fraudulent ransaction predicted as an honest one) is high. The reason can unbalanced data or need for more feature engineering.

#Sampling Data 


After all described before, I resampled the data and tried all algorithms with balanced data sets. There were 2 methods implemented to balance the data set: Randomly Oversampling and Synthetic Minority Over-sampling Technique. All methods did not show better results with balanced data sets. And as Tree based methods showed good results with unbalanced as well, I decided to report results from original data.

In [0]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(ratio='minority')
X_sm, y_sm = smote.fit_sample(X, y)

In [0]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)

# Conclusion

Random Forest and Ada Boost are the top 2 methods from tested. Comparing accuracy and F measures these methods showed better and stable results. 