Importing the Dependencies

In [77]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [78]:
 # Loading the dataset to Pandas DataFrame
 credit_card_data = pd.read_csv('/creditcard.csv')

In [79]:
# First 5 rows of the dataset
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [80]:
# Dataset information
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1986 entries, 0 to 1985
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    1986 non-null   int64  
 1   V1      1986 non-null   float64
 2   V2      1986 non-null   float64
 3   V3      1986 non-null   float64
 4   V4      1986 non-null   float64
 5   V5      1986 non-null   float64
 6   V6      1986 non-null   float64
 7   V7      1986 non-null   float64
 8   V8      1986 non-null   float64
 9   V9      1986 non-null   float64
 10  V10     1986 non-null   float64
 11  V11     1986 non-null   float64
 12  V12     1986 non-null   float64
 13  V13     1986 non-null   float64
 14  V14     1985 non-null   float64
 15  V15     1985 non-null   float64
 16  V16     1985 non-null   float64
 17  V17     1985 non-null   float64
 18  V18     1985 non-null   float64
 19  V19     1985 non-null   float64
 20  V20     1985 non-null   float64
 21  V21     1985 non-null   float64
 22  

In [81]:
# Checking the number of missing values in each column
credit_card_data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [82]:
# Ditribution of legit transaction & fradulent Transaction
credit_card_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,1983
1.0,2


This Dataset is highly unbalanced

0 ---> Normal Transaction

1 ---> Fraudulent Transaction

In [83]:
# Separating the data from analysis
legit = credit_card_data[credit_card_data.Class == 0.0]
fraud = credit_card_data[credit_card_data.Class == 1.0]


In [84]:
# Printing Shapes
print(legit.shape)
print(fraud.shape)

(1983, 31)
(2, 31)


In [85]:
# Stastistical measures of the data
legit.Amount.describe()

Unnamed: 0,Amount
count,1983.0
mean,68.404892
std,241.572682
min,0.0
25%,4.95
50%,15.09
75%,63.285
max,7712.43


In [86]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,2.0
mean,264.5
std,374.059487
min,0.0
25%,132.25
50%,264.5
75%,396.75
max,529.0


In [87]:
# Comparing the Values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,760.974786,-0.281494,0.2672,0.848906,0.146804,-0.077904,0.051713,0.139533,-0.059771,0.014492,...,0.056618,-0.012217,-0.144666,-0.043548,0.013865,0.108318,0.049441,0.02722,-0.001966,68.404892
1.0,439.0,-2.677884,-0.602658,-0.260694,3.143275,0.418809,-1.245684,-1.105907,0.661932,-1.520521,...,1.114625,0.589464,0.200214,0.455377,0.013198,0.162159,0.016239,0.004186,-0.053756,264.5


Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transaction

Number of Fraudulent Transactions --> 2

In [88]:
legit_sample = legit.sample(n=2)

Concatenating two DataFrames

In [89]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [90]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
1813,1409,-4.526521,3.50231,-1.092791,-2.850186,-1.059448,-0.394679,-1.657027,-5.94917,2.175552,...,5.895853,-1.668175,0.936892,0.445784,0.31896,-0.183302,0.263425,-0.415279,1.0,0.0
1776,1377,0.920252,-1.194826,-0.655345,-0.782561,1.149542,3.936842,-0.995843,1.007688,1.05453,...,-0.178586,-0.813215,-0.113358,1.086534,0.229358,0.938715,-0.065352,0.038776,187.95,0.0
541,406,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1.0
623,472,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1.0


In [91]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
1813,1409,-4.526521,3.50231,-1.092791,-2.850186,-1.059448,-0.394679,-1.657027,-5.94917,2.175552,...,5.895853,-1.668175,0.936892,0.445784,0.31896,-0.183302,0.263425,-0.415279,1.0,0.0
1776,1377,0.920252,-1.194826,-0.655345,-0.782561,1.149542,3.936842,-0.995843,1.007688,1.05453,...,-0.178586,-0.813215,-0.113358,1.086534,0.229358,0.938715,-0.065352,0.038776,187.95,0.0
541,406,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1.0
623,472,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1.0


In [92]:
new_dataset['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,2
1.0,2


In [93]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,1393.0,-1.803135,1.153742,-0.874068,-1.816373,0.045047,1.771081,-1.326435,-2.470741,1.615041,...,-0.138208,2.858633,-1.240695,0.411767,0.766159,0.274159,0.377706,0.099036,-0.188252,94.475
1.0,439.0,-2.677884,-0.602658,-0.260694,3.143275,0.418809,-1.245684,-1.105907,0.661932,-1.520521,...,1.114625,0.589464,0.200214,0.455377,0.013198,0.162159,0.016239,0.004186,-0.053756,264.5


Splitting the data into Features & Targets

In [94]:
x = new_dataset.drop(columns='Class', axis=1)
y = new_dataset['Class']

In [95]:
print(x)

      Time        V1        V2        V3        V4        V5        V6  \
1813  1409 -4.526521  3.502310 -1.092791 -2.850186 -1.059448 -0.394679   
1776  1377  0.920252 -1.194826 -0.655345 -0.782561  1.149542  3.936842   
541    406 -2.312227  1.951992 -1.609851  3.997906 -0.522188 -1.426545   
623    472 -3.043541 -3.157307  1.088463  2.288644  1.359805 -1.064823   

            V7        V8        V9  ...       V20       V21       V22  \
1813 -1.657027 -5.949170  2.175552  ... -0.687134  5.895853 -1.668175   
1776 -0.995843  1.007688  1.054530  ...  0.410719 -0.178586 -0.813215   
541  -2.537387  1.391657 -2.770089  ...  0.126911  0.517232 -0.035049   
623   0.325574 -0.067794 -0.270953  ...  2.102339  0.661696  0.435477   

           V23       V24       V25       V26       V27       V28  Amount  
1813  0.936892  0.445784  0.318960 -0.183302  0.263425 -0.415279    1.00  
1776 -0.113358  1.086534  0.229358  0.938715 -0.065352  0.038776  187.95  
541  -0.465211  0.320198  0.044519  0.

In [96]:
print(y)

1813    0.0
1776    0.0
541     1.0
623     1.0
Name: Class, dtype: float64


Spliting the data into Training Data & Testing Data

In [97]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2, stratify=y, random_state=2)

In [98]:
print(x.shape, x_train.shape, x_test.shape)

(4, 30) (2, 30) (2, 30)


Model Training

In [99]:
model=LogisticRegression()

In [100]:
# Training the logistic Regression Model with Training Data
model.fit(x_train, y_train)

Model Evaluation & Accuracy Score

In [101]:
# Accuracy of training data
X_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [102]:
print('Accuracy on Training Data : ', training_data_accuracy)

Accuracy on Training Data :  1.0


In [103]:
# Accuracy on test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [104]:
print('Accuracy on Test Data : ', test_data_accuracy)

Accuracy on Test Data :  1.0


In [105]:
from sklearn.metrics import precision_score, recall_score, f1_score

test_precision = precision_score(y_test, x_test_prediction)
test_recall = recall_score(y_test, x_test_prediction)
test_f1 = f1_score(y_test, x_test_prediction)

print('Precision on Test Data : ', test_precision)
print('Recall on Test Data : ', test_recall)
print('F1-score on Test Data : ', test_f1)

Precision on Test Data :  1.0
Recall on Test Data :  1.0
F1-score on Test Data :  1.0


 Precision, recall, and F1-score of the Program

In [106]:
print("Precision: Precision measures the accuracy of the positive predictions. In credit card fraud detection, it's the ratio of correctly identified fraudulent transactions (True Positives) to the total number of transactions predicted as fraudulent (True Positives + False Positives). High precision means fewer legitimate transactions are flagged as fraudulent.")
print("Recall: Recall measures the ability of the model to find all the positive cases. In credit card fraud detection, it's the ratio of correctly identified fraudulent transactions (True Positives) to the total number of actual fraudulent transactions (True Positives + False Negatives). High recall means fewer actual fraudulent transactions are missed by the model.")
print("F1-score: The F1-score is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall. A high F1-score indicates that the model has a good balance between correctly identifying fraudulent transactions and not incorrectly flagging legitimate transactions.")

Precision: Precision measures the accuracy of the positive predictions. In credit card fraud detection, it's the ratio of correctly identified fraudulent transactions (True Positives) to the total number of transactions predicted as fraudulent (True Positives + False Positives). High precision means fewer legitimate transactions are flagged as fraudulent.
Recall: Recall measures the ability of the model to find all the positive cases. In credit card fraud detection, it's the ratio of correctly identified fraudulent transactions (True Positives) to the total number of actual fraudulent transactions (True Positives + False Negatives). High recall means fewer actual fraudulent transactions are missed by the model.
F1-score: The F1-score is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall. A high F1-score indicates that the model has a good balance between correctly identifying fraudulent transactions and not incorrectly flagging

## Summary:

### Data Analysis Key Findings

*   The initial model achieved perfect scores (1.0 for precision, recall, and F1-score) on the test data.
*   The perfect metrics are likely due to the very small size of the test dataset (only 2 transactions), which may not be representative of real-world performance.
*   Precision indicates that when the model predicted fraud, it was always correct (no false positives).
*   Recall indicates that the model identified all actual fraudulent transactions (no false negatives).
*   F1-score reflects the perfect balance between precision and recall achieved on this small test set.

### Insights or Next Steps

*   The model should be evaluated on a significantly larger and more diverse test dataset to obtain a reliable assessment of its performance in a real-world scenario.
*   Given the likely class imbalance in a larger credit card fraud dataset, techniques like oversampling (e.g., SMOTE) or undersampling should be considered during the training phase to improve the model's ability to detect the minority class (fraudulent transactions).
