# CREDIT CARD FRAUD DETECTION

**Problem Statement:-**

• Build a machine learning model to identify fraudulent credit card transactions. • Preprocess and normalize the transaction data, handle class imbalance issues, and split the dataset into training and testing sets.

• Train a classification algorithm, such as logistic regression or random forests, to classify transactions as fraudulent or genuine.

• Evaluate the model's performance using metrics like precision, recall, and Fl-score, and consider techniques like oversampling or undersampling for improving results

In [83]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/creditcardfraud/creditcard.csv


### Data Collection

In [84]:
df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')

### Data Exploration

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [86]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [87]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [88]:
print(df['Class'].value_counts(normalize=True))

0    0.998273
1    0.001727
Name: Class, dtype: float64


###  Data Preprocessing

In [89]:
from sklearn.preprocessing import StandardScaler # Normalization

scaler = StandardScaler() # Initialize the Standard Scaler

df[df.columns.difference(['Class'])] = scaler.fit_transform(df[df.columns.difference(['Class'])])
# Normalizeing numericals (no 'Class' column)

### Handling Class Imbalance

In [90]:
from imblearn.over_sampling import SMOTE

In [91]:
X = df.drop('Class', axis=1)
y = df['Class'] # feature & target varibale

In [92]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)  # for oversmpling to minority calass we apply smote here

### Data Splitting

In [93]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
# we spliting the resample data for test set and train set

### Model Selection & Training

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42) # we use RFC Model
model.fit(X_train, y_train) # model training


### Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [None]:
y_pred = model.predict(X_test) # prediction on testing

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy*100) # axcuracy
print("Accuracy:", accuracy) # axcuracy


precision = precision_score(y_test, y_pred)
print("Precision:", precision*100)
print("Precision:", precision) # precision

recall = recall_score(y_test, y_pred)
print("Recall:", recall*100)
print("Recall:", recall) # recall

f1 = f1_score(y_test, y_pred)
print("F1-score:", f1*100)
print("F1-score:", f1) # F1-score

An accuracy of 99.989% and an F1-score of 99.989% are very high values, suggesting that the model is performing exceptionally well on the given dataset. Similarly, a precision of 99.978% and a recall of 100% indicate that the model is able to predict the positive class (fraudulent transactions) with very high accuracy and is also able to correctly identify all instances of the positive class.

In [None]:
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report) # classification report

Provided classification report indicates that the model has achieved perfect precision, recall, and F1-score values of 1.00 for both classes (0 and 1). Additionally, the macro average and weighted average metrics are also 1.00, indicating that model's predictions align perfectly with the ground truth labels across all classes and samples.

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)  # confusion matrix

True Positive (TP): 56976

True Negative (TN): 56738

False Positive (FP): 12

False Negative (FN): 0