### Credit Card Fraud Detection with Machine Learning

**Dataset Overview:**
The dataset consists of credit card transaction records with **31 features** plus a **target label**. Each feature provides details about the transaction, while the target label identifies whether the transaction is **fraudulent (1)** or **legitimate (0)**.

* **Time**: The first column indicates the number of seconds passed since the first transaction in the dataset.
* **V1 to V28**: These 28 features are anonymized variables derived from a **Principal Component Analysis (PCA)** of the original attributes. They capture hidden patterns in transaction behavior, such as spending style, transaction type, or location, while keeping sensitive information secure.
* **Amount**: The second last feature represents the **transaction value in USD**.
* **Class**: The final column is the label, where `0` stands for a normal transaction and `1` marks it as fraudulent.

This dataset is widely used to build **machine learning models for fraud detection**. By analyzing these features, models can identify suspicious activity and help in **real-time fraud prevention**.

---



import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
credit_card_data = pd.read_csv('creditcard.csv')

In [4]:
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
# dataset information
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [7]:
credit_card_data.shape

(284807, 31)

In [9]:
credit_card_data['Class']

0         0
1         0
2         0
3         0
4         0
         ..
284802    0
284803    0
284804    0
284805    0
284806    0
Name: Class, Length: 284807, dtype: int64

In [12]:
credit_card_data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

his dataset is highly imbalanced:

Here,
0 → Normal transaction
1 → Fraudulent transaction

In the first line of code, a new DataFrame named legit is created by filtering the original credit_card_data and keeping only the rows where the Class column equals 0. This means the dataset will contain only legitimate transactions, excluding all fraudulent ones.

In the second line, another DataFrame called fraud is generated by selecting rows from credit_card_data where the Class value is 1. This keeps only the fraudulent transactions while removing the legitimate ones.

Separating the dataset this way makes it easier to analyze both groups independently. By comparing legitimate and fraudulent transactions, we can discover patterns or unique features linked to fraudulent activity. These insights are valuable for building more effective fraud detection models.

In [13]:
credit_card_data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.175161e-15,3.384974e-16,-1.379537e-15,2.094852e-15,1.021879e-15,1.494498e-15,-5.620335e-16,1.149614e-16,-2.414189e-15,...,1.62862e-16,-3.576577e-16,2.618565e-16,4.473914e-15,5.109395e-16,1.6861e-15,-3.661401e-16,-1.227452e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [14]:
legit = credit_card_data[credit_card_data.Class==0]
fraud = credit_card_data[credit_card_data['Class']==1]

In [15]:
legit.value_counts()

Time      V1          V2          V3         V4         V5         V6         V7         V8         V9         V10         V11        V12        V13        V14        V15        V16        V17        V18        V19        V20        V21        V22        V23        V24        V25        V26        V27        V28        Amount  Class
163152.0  -1.196037    1.585949    2.883976   3.378471   1.511706   3.717077   0.585362  -0.156001   0.122648   4.217934    1.385525  -0.709405  -0.256168  -1.564352   1.693218  -0.785210  -0.228008  -0.412833   0.234834   1.375790  -0.370294   0.524395  -0.355170  -0.869790  -0.133198   0.327804  -0.035702  -0.858197  7.56    0        18
          -1.203617    1.574009    2.889277   3.381404   1.538663   3.698747   0.560211  -0.150911   0.124136   4.220998    1.384569  -0.706897  -0.256274  -1.562583   1.692915  -0.787338  -0.226776  -0.412354   0.234322   1.385597  -0.366727   0.522223  -0.357329  -0.870174  -0.134166   0.327019  -0.042648  -0.855262  1.5

In [16]:
fraud.value_counts()

Time     V1          V2         V3          V4        V5          V6         V7          V8          V9         V10         V11       V12         V13        V14         V15        V16        V17         V18        V19        V20        V21         V22        V23        V24        V25        V26        V27        V28        Amount  Class
68207.0  -13.192671  12.785971  -9.906650   3.320337  -4.801176    5.760059  -18.750889  -37.353443  -0.391540  -5.052502   4.406806  -4.610756   -1.909488  -9.072711   -0.226074  -6.211557  -6.248145   -3.149247   0.051576  -3.493050   27.202839  -8.887017   5.303607  -0.639435   0.263203  -0.108877   1.269566   0.939407  1.00    1        6
94362.0  -26.457745  16.497472  -30.177317  8.904157  -17.892600  -1.227904  -31.197329  -11.438920  -9.462573  -22.187089  4.419997  -10.592305  -0.703796  -3.926207   -2.400246  -6.809890  -12.462315  -5.501051  -0.567940   2.812241  -8.755698    3.460893   0.896538   0.254836  -0.738097  -0.966564  -7.263482  -1.

In [17]:
fraud.shape

(492, 31)

In [18]:
legit.shape

(284315, 31)

In [19]:
fraud['Class']

541       1
623       1
4920      1
6108      1
6329      1
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 492, dtype: int64

In [20]:
legit.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64


* **count** → total number of valid (non-missing) transaction amounts.
* **mean** → average transaction amount (≈ 88.29 ≈ 5.65 USD).
* **50% (median)** → middle value; half the transactions are ≤ 22.0 (≈ 77.05 USD).
* **max** → highest transaction amount (25,691.16), showing the presence of very large transactions.




In [21]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [27]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


To create a sample dataset with a more balanced distribution of legitimate and fraudulent transactions, we need to include the same number of records from each category.

In the original dataset, there are 492 fraudulent transactions. Therefore, to balance the dataset, we can randomly select 492 normal transactions as well.

This process ensures that the new dataset has:

492 legitimate transactions

492 fraudulent transactions

By having an equal count for both classes, the dataset becomes balanced, making it easier for machine learning models to learn patterns without being biased toward the majority class (normal transactions).

In [38]:
legit_sample = legit.sample(n=492)

randomly selects 492 records from the legit dataset. This balances the dataset by making the number of legitimate and fraudulent transactions equal. A balanced dataset prevents the model from being biased toward normal transactions and helps it learn fraud patterns more effectively.

In [39]:
credit_card_data = pd.concat([legit_sample,fraud],axis=0)


In [40]:
credit_card_data['Class']

112095    0
201725    0
131905    0
145627    0
137363    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64

In [42]:
credit_card_data['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [43]:
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,95155.711382,0.042288,0.00502,-0.058338,0.025189,0.039085,-0.007659,0.05898,0.008876,-0.076436,...,0.036048,-0.01407,-0.034775,0.028646,0.003255,-0.022405,-0.018522,0.003857,-0.008379,95.660102
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [44]:
X = credit_card_data.drop(columns='Class', axis=1)
Y = credit_card_data['Class']

In [45]:
X.shape

(984, 30)

In [46]:
Y.shape

(984,)

In [47]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

### Model Training – Logistic Regression

Logistic Regression is a widely used statistical method for binary classification problems, such as fraud detection, where the outcome has only two possible values:

**0 → Legitimate transaction**

**1 → Fraudulent transaction**

Unlike linear regression, which predicts continuous values, logistic regression predicts the probability that a given transaction belongs to a particular class. It uses a sigmoid (logistic) function to map predictions between 0 and 1.

**Steps in training:**

The model learns from the training dataset by finding relationships between input features (e.g., transaction amount, PCA features) and the target label (Class).

It estimates the probability of fraud for each transaction.

A threshold (commonly 0.5) is applied:

Probability ≥ 0.5 → Fraud (1)

Probability < 0.5 → Legitimate (0)

**Why Logistic Regression for Fraud Detection?**

Simple and fast to implement.

Works well with imbalanced datasets when combined with balancing techniques (like undersampling or oversampling).

Provides interpretable results, as feature coefficients show how strongly each feature contributes to predicting fraud.

In [48]:
model=LogisticRegression()
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.9364675984752223


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [49]:
# accuracy of testing data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.934010152284264
