Importing the Required Libraries

In [81]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Loading the Data into Panda dataframes to perform better operations 

In the data provided, the credit card data are really important and are disclosed ideally to anyone. Hence the data are converted into 28 parameters which can be used to generate our model and make predictions. Also the transactions are in dollars ($).

#The class column in the dataset tells us whethere the transaction in legit or fraudulent. (0 ~ Legit and 1 ~ Fraudulent)

In [82]:
card_data = pd.read_csv('creditcard.csv')
# card_data.info() # Types of data types in our dataset
# card_data.isnull() # We found no null values in the given data 

In [83]:
card_data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

As we can clearly see that the data is highly unbalaced and the number of cases for fraudulent are very low. This will inturn predict legit transaction for every transaction if we train our model on this whole dataset. Hence we need to preprocess it in a way we are able to train our model properly.

Now seperating the cases of legit and fraudulent transactions

In [84]:
legit_transactions = card_data[card_data.Class == 0]
fraud_transactions = card_data[card_data.Class == 1]


In [85]:
legit_transactions.Amount.describe() # Gives us the statistics about the Amount Column in the whole legitTransactions subdataset

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [86]:
fraud_transactions.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

We can see that the mean amount for the fraud transactions is quite larger than the amount for legit transactions. This is an important insight.

In [87]:
card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Since We have an unbalanced amount of examples in the given dataset, we need to use the method of under-sampling. In this we take random samples from the larger subdataset (here class '0') and take randomly 492 samples. 
(We have 492 fraud samples)

In [88]:
legit_transactions_sample = legit_transactions.sample(n=492)

Now concatenating it with the fraud transactions pd dataframe to finally make our training set.

In [89]:
final_dataset  = pd.concat([legit_transactions_sample,fraud_transactions],axis=0) #axis=0 indicates concatenating row-wise

In [90]:
final_dataset['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [91]:
final_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94661.924797,0.128783,-0.007243,0.011059,-0.101225,-0.125999,-0.044394,0.123602,-0.067611,-0.041086,...,0.008984,0.011228,0.095861,-0.006189,0.007748,0.016778,0.019476,0.024282,0.019691,101.722073
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


The means are almost similar since the mean values are almost similar, this indicates that the random sampling is quite similar to the original dataset and we can proceed further with the training.

Now Splitting into features and targets.

In [92]:
X = final_dataset.drop(columns='Class',axis =1) #Just removes the column Class and passes the rest
Y = final_dataset['Class']


Splitting into training and testing data

In [93]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.15,stratify=Y,random_state=3)

Model Training
I choose Logistic Regression because we have only 2 classes here, that is legit and fraudulent transactions.


In [94]:
model = LogisticRegression(max_iter=2000)

In [95]:
model.fit(X_train,Y_train)

In [96]:
X_train_predict = model.predict(X_train)
trainingDataAccuracy = accuracy_score(X_train_predict,Y_train)


In [97]:
print(trainingDataAccuracy)

0.9389952153110048


In [98]:
X_test_prediction = model.predict(X_test)
testingDataAccuracy = accuracy_score(X_test_prediction,Y_test)

In [99]:
print(testingDataAccuracy)

0.9527027027027027


Here we need to make sure that the testing accuracy is inline with the training accuracy. If the testing accuracy is far less (~20-40%) than the training accuracy then we say that the model is overfitted to the training data and when exposed to the new data is unable to predict accurately. If the training accuracy is far less then we say that the model has underfitted.