# What Is credit card fraud?

The act of using someone else's credit card to make purchases or apply for cash advances without the cardholder's knowledge or approval is known as credit card fraud. Although fraudsters are increasingly using digital methods to acquire the credit card number and associated personal information to perform illegal transactions, these criminals may still obtain the card itself through physical theft.

 Identity theft and credit card theft share several similarities. Actually, one of the most prevalent types of identity theft is credit card theft. In these situations, a fraudster opens a new account that the victim is unaware of using the person's personal information, which is frequently taken in the course of a cyberattack or data breach. This conduct is regarded as credit fraud as well as identity fraud and credit card fraud.

**Types of credit card fraud**

Credit card fraud falls into two basic categories:

1.Card present fraud

2.Card-not-present fraud

# What is credit card fraud detection?

The term "credit card fraud detection" refers to the collection of procedures, methods, instruments, and regulations used by financial institutions and credit card issuers to thwart identity theft and halt fraudulent transactions.  

With the explosion of data and the rise in credit card transactions in recent years, credit fraud detection has mostly moved online and into automation. The majority of contemporary systems use machine learning (ML) and artificial intelligence (AI) to handle decision-making, data analysis, predictive modeling, fraud alerts, and remedial actions that are taken after specific credit card fraud cases are found.  


**Anomaly detection**

The practice of collecting vast volumes of data from internal and external sources in order to create a framework of "normal" activity for each individual user and identify consistent trends in their behavior is known as anomaly detection.

Data used to create the user profile includes:

* Purchase history and other historical data
* Location
* Device ID
* IP address
* Payment amount
* Transaction information

**Predictive modeling**

ML models and predictive analytics can be used to detect and identify fraud patterns or hint at an ongoing, complex fraud scheme in addition to identifying anomalies within a given user account. The capacity to predict future outcomes is crucial since hackers are always refining their tactics to avoid being discovered by current tools and approaches.  

**Outlier models**

Finally, outlier models are also a feature of several anomaly detection programs. An outlier model, as its name suggests, detects unusual behavior when there is insufficient data to make predictions about patterns.

**Import Libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# Loding the Dataset to Pandas Dataframe
cradit_card_data = pd.read_csv('/content/creditcard.csv')

In [None]:
cradit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [None]:
cradit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
77333,57003,-1.565835,0.537575,3.284121,3.229021,-0.917761,2.016339,-1.157749,1.086392,0.234172,...,0.125678,1.07283,-0.229837,0.094444,0.215087,0.544487,0.271867,-0.089124,25.69,0.0
77334,57005,-0.710264,-0.09532,2.899716,0.718612,-0.501955,0.968641,-0.007123,0.308006,1.383339,...,0.069589,0.711129,-0.03857,0.08899,-0.282553,-0.448201,0.05445,-0.051693,65.0,0.0
77335,57005,0.875729,-0.658494,-0.798643,-0.889801,-0.205406,-1.093946,0.743501,-0.381269,0.405087,...,0.192299,0.161282,-0.451218,0.060376,0.885705,-0.477421,-0.036297,0.035704,235.53,0.0
77336,57006,-0.679923,1.074176,1.045563,1.10062,-0.764069,-1.048969,0.601586,0.283135,-0.67482,...,0.256539,0.475028,0.124473,0.886947,-0.327076,-0.362904,0.017048,0.10904,73.52,0.0
77337,57006,1.380239,-1.328341,1.488601,-1.259442,-2.253642,-0.19398,-1.799733,0.139147,-1.245753,...,,,,,,,,,,


In [None]:
# Dataset Information
cradit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77338 entries, 0 to 77337
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    77338 non-null  int64  
 1   V1      77338 non-null  float64
 2   V2      77338 non-null  float64
 3   V3      77338 non-null  float64
 4   V4      77338 non-null  float64
 5   V5      77338 non-null  float64
 6   V6      77338 non-null  float64
 7   V7      77338 non-null  float64
 8   V8      77338 non-null  float64
 9   V9      77338 non-null  float64
 10  V10     77338 non-null  float64
 11  V11     77338 non-null  float64
 12  V12     77338 non-null  float64
 13  V13     77338 non-null  float64
 14  V14     77337 non-null  float64
 15  V15     77337 non-null  float64
 16  V16     77337 non-null  float64
 17  V17     77337 non-null  float64
 18  V18     77337 non-null  float64
 19  V19     77337 non-null  float64
 20  V20     77337 non-null  float64
 21  V21     77337 non-null  float64
 22

In [None]:
# Checking te number of missing value in each column
cradit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       1
V15       1
V16       1
V17       1
V18       1
V19       1
V20       1
V21       1
V22       1
V23       1
V24       1
V25       1
V26       1
V27       1
V28       1
Amount    1
Class     1
dtype: int64

In [None]:
# Clear Null value from the Dataset

cradit_card_data.dropna(inplace=True)

In [None]:
cradit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [None]:
#  Distribution of Lagit transction & Fraudulent transaction
cradit_card_data['Class'].value_counts()

Class
0.0    77149
1.0      188
Name: count, dtype: int64

This Dataset is Highly unbalanced

0.0 --> Normal Transction

1.0 --> Fraudulent Transaction

In [None]:
# Separating the data for analysis

legit = cradit_card_data[cradit_card_data.Class == 0]
fraud = cradit_card_data[cradit_card_data.Class == 1]

In [None]:
print(legit.shape)
print(fraud.shape)

(77149, 31)
(188, 31)


In [None]:
# Statistical measure of the data
legit.Amount.describe()

count    77149.000000
mean        97.625867
std        270.623024
min          0.000000
25%          7.690000
50%         26.800000
75%         89.000000
max      19656.530000
Name: Amount, dtype: float64

In [None]:
fraud.Amount.describe()

count     188.000000
mean       94.292500
std       214.093799
min         0.000000
25%         1.000000
50%         7.550000
75%        99.990000
max      1809.680000
Name: Amount, dtype: float64

In [None]:
# Compare the values for both transaction
cradit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,36643.938353,-0.239252,-0.043518,0.701351,0.152168,-0.264608,0.10159,-0.097359,0.046639,0.005268,...,0.042235,-0.030677,-0.105246,-0.038271,0.007363,0.134344,0.025494,0.000766,0.002836,97.625867
1.0,32055.739362,-6.692335,4.711159,-8.903156,5.250821,-4.90527,-2.04085,-7.040342,3.194879,-3.137542,...,0.374793,0.797237,-0.171993,-0.229811,-0.084322,0.242355,0.097236,0.586565,0.050794,94.2925


**Under Sampling**

Build a sample containing similar distribution of lagit transction and fraudulent transaction.

Number of Fraudulent Transaction --> 235

In [None]:
legit_sample = legit.sample(n=235)

**Concatenating two DataFrames**

In [None]:
df = pd.concat([legit_sample, fraud], axis=0)

In [None]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
38687,39497,0.638671,-0.75366,-1.010706,1.576892,1.825976,3.992803,-0.293181,0.871099,-0.706463,...,0.283239,-0.033091,-0.391112,1.025202,0.554873,0.109766,-0.049818,0.072898,310.26,0.0
18831,29804,1.300808,-0.488333,-0.207323,-0.533079,-0.448373,-0.278322,-0.479955,-0.04541,-0.624055,...,-0.060991,-0.203066,-0.089616,-0.530904,0.467647,-0.242943,0.030805,0.037067,55.3,0.0
58437,48388,1.190007,0.33011,0.418664,0.718089,-0.324112,-0.784032,0.031123,-0.098646,-0.026927,...,-0.235927,-0.637634,0.196293,0.366092,0.107213,0.103167,-0.004759,0.03122,1.98,0.0
18652,29666,-0.47079,1.092801,0.827531,0.164187,1.068948,0.426464,0.658549,-0.163283,-0.215004,...,-0.304076,-0.603212,-0.179898,-1.352417,-0.470777,0.244773,-0.045642,-0.011807,2.28,0.0
20709,31225,1.347733,-0.680297,-0.148854,-0.972619,-0.450982,0.062023,-0.5104,-0.017737,-1.037347,...,-0.492909,-1.091472,-0.078668,-0.979541,0.230715,0.963584,-0.068981,-0.003582,56.94,0.0


In [None]:
df.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
75511,56098,-1.229669,1.956099,-0.851198,2.796987,-1.913977,-0.044934,-1.340739,-0.555548,-1.184468,...,1.208054,0.277612,0.019266,0.508529,-0.201183,-0.2496,0.562239,0.075309,170.92,1.0
76555,56624,-7.901421,2.720472,-7.885936,6.348334,-5.480119,-0.333059,-8.682376,1.164431,-4.542447,...,0.077739,1.092437,0.320133,-0.434643,-0.380687,0.21363,0.42362,-0.105169,153.46,1.0
76609,56650,-8.762083,2.79103,-7.682767,6.991214,-5.230695,-0.357388,-9.685621,1.749335,-4.495679,...,-0.090527,0.34859,0.051132,-0.41543,0.219665,0.33002,-0.028252,-0.15627,7.52,1.0
76929,56806,0.016828,2.400826,-4.22036,3.462217,-0.624142,-1.294303,-2.986028,0.751883,-1.606672,...,0.285832,-0.771508,-0.2652,-0.873077,0.939776,-0.219085,0.874494,0.470434,1.0,1.0
77099,56887,-0.075483,1.812355,-2.566981,4.127549,-1.628532,-0.805895,-3.390135,1.019353,-2.451251,...,0.794372,0.270471,-0.143624,0.013566,0.634203,0.213693,0.773625,0.387434,5.0,1.0


In [None]:
df['Class'].value_counts()

Class
0.0    235
1.0    188
Name: count, dtype: int64

In [None]:
df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,36825.485106,-0.298875,0.007493,0.6471,0.112804,-0.381484,0.140746,-0.095909,0.004292,-0.046838,...,-0.017711,0.047913,-0.106496,0.004067,-0.003975,0.139553,0.062809,0.031523,-0.007768,105.595191
1.0,32055.739362,-6.692335,4.711159,-8.903156,5.250821,-4.90527,-2.04085,-7.040342,3.194879,-3.137542,...,0.374793,0.797237,-0.171993,-0.229811,-0.084322,0.242355,0.097236,0.586565,0.050794,94.2925


**Split the Data into features & targets**

In [None]:
X = df.drop(columns='Class', axis=1)
Y = df['Class']

In [None]:
print(X)
print(Y)

        Time        V1        V2        V3        V4        V5        V6  \
38687  39497  0.638671 -0.753660 -1.010706  1.576892  1.825976  3.992803   
18831  29804  1.300808 -0.488333 -0.207323 -0.533079 -0.448373 -0.278322   
58437  48388  1.190007  0.330110  0.418664  0.718089 -0.324112 -0.784032   
18652  29666 -0.470790  1.092801  0.827531  0.164187  1.068948  0.426464   
20709  31225  1.347733 -0.680297 -0.148854 -0.972619 -0.450982  0.062023   
...      ...       ...       ...       ...       ...       ...       ...   
75511  56098 -1.229669  1.956099 -0.851198  2.796987 -1.913977 -0.044934   
76555  56624 -7.901421  2.720472 -7.885936  6.348334 -5.480119 -0.333059   
76609  56650 -8.762083  2.791030 -7.682767  6.991214 -5.230695 -0.357388   
76929  56806  0.016828  2.400826 -4.220360  3.462217 -0.624142 -1.294303   
77099  56887 -0.075483  1.812355 -2.566981  4.127549 -1.628532 -0.805895   

             V7        V8        V9  ...       V20       V21       V22  \
38687 -0.2931

**Split the Data into Training data & Test data**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)
print(Y.shape, Y_train.shape, Y_test.shape)

(423, 30) (338, 30) (85, 30)
(423,) (338,) (85,)


**Model Training**

Logistic Regression apply here

In [None]:
model = LogisticRegression()

In [None]:
# Traing the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)

**Model Evaluation**


Accuracy Score

In [None]:
# Accuracy on Training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print(f'Accuracy on Training Data:{training_data_accuracy}')

Accuracy on Training Data:0.9349112426035503


In [None]:
# Accuracy on Test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print(f'Accuracy on Test Data:{test_data_accuracy}')

Accuracy on Test Data:0.9411764705882353



In simple terms, these accuracy results mean that for a credit card fraud detection system:

* On the training data (data used to teach the system), it correctly identified fraud about **93.5%** of the time.
* On the test data (new data not used during training), it correctly identified fraud about **94.1%** of the time.

This suggests that the system performs well in identifying fraudulent credit card transactions, with a high level of accuracy on both the data it was trained on and new, unseen data.




