# **Credit Card Fraud Detection using Logistic Regression**

In this Colab notebook, we'll explore a Credit Card Fraud Detection problem using a Logistic Regression model. Credit card fraud detection is a critical task in the financial industry, where the goal is to identify fraudulent transactions among a large number of legitimate ones.

We will perform the following steps in this notebook:

1. **Data Loading and Exploration:** We'll begin by importing necessary libraries, loading the credit card transaction dataset, and exploring the dataset to understand its structure and characteristics.

2. **Data Preprocessing:** We'll handle missing values, examine the distribution of legitimate and fraudulent transactions, and perform under-sampling to create a balanced dataset for modeling.

3. **Feature Engineering:** We'll split the data into features (X) and the target variable (Y).

4. **Data Splitting:** We'll split the dataset into training and testing sets to evaluate our model's performance.

5. **Model Training:** We'll create a Logistic Regression model and train it using the training data.

6. **Model Evaluation:** We'll evaluate the model's performance using accuracy scores on both the training and testing datasets.

## Dataset
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

This notebook provides a step-by-step guide to building a fraud detection model and evaluating its effectiveness. Let's get started!


Importing Dependencies

In [None]:
# Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Reading, Importing and Modifying Data

In [None]:
# Load the dataset into a Pandas DataFrame
credit_card_data = pd.read_csv('/content/credit_data.csv')

In [None]:
# Display the first 5 rows of the dataset
credit_card_data.head()

In [None]:
# Display the last 5 rows of the dataset
credit_card_data.tail()

In [None]:
# Provide information about the dataset, including data types and non-null counts
credit_card_data.info()

In [None]:
# Check the number of missing values in each column
credit_card_data.isnull().sum()

In [None]:
# Count the distribution of legitimate transactions (0) and fraudulent transactions (1)
credit_card_data['Class'].value_counts()

In [None]:
# Separate the data into legitimate and fraudulent transactions
# 0 --> Normal Transaction
# 1 --> Fraudulent Transaction
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [None]:
# Display the shape (number of rows and columns) of the legitimate and fraudulent data
print(legit.shape)
print(fraud.shape)

In [None]:
# Calculate statistical measures for the amount in legitimate transactions
legit.Amount.describe()

In [None]:
# Calculate statistical measures for the amount in fraudulent transactions
fraud.Amount.describe()

In [None]:
# Compare the mean values for both legitimate and fraudulent transactions
credit_card_data.groupby('Class').mean()

This dataset is highly unbalanced, as the number of Fraudulent Transactions --> 492

Performing Under-Sampling i.e. Build a sample dataset containing a similar distribution of normal transactions and fraudulent transactions.

# Model Initiation

### Preping the data for Model

In [None]:
# Sample a subset of legitimate transactions with 492 entries
legit_sample = legit.sample(n=492)

Concatenating two DataFrames

In [None]:
# Concatenate two DataFrames to create a new dataset
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [None]:
# Display the first 5 rows of the new dataset
new_dataset.head()

In [None]:
# Display the last 5 rows of the new dataset
new_dataset.tail()

In [None]:
# Count the distribution of legitimate (0) and fraudulent (1) transactions in the new dataset
new_dataset['Class'].value_counts()

In [None]:
# Calculate and display the mean values for both classes in the new dataset
new_dataset.groupby('Class').mean()

Splitting the data into Features (X) and Targets (Y)

In [None]:
# Separate the data into features (X) and the target variable (Y)
X = new_dataset.drop(columns='Class', axis=1)
Y = new_dataset['Class']

In [None]:
# Print the features (X) and the target variable (Y)
print(X)
print(Y)

Split the data into Training Data and Testing Data

In [None]:
# Split the data into training data and testing data with a 80/20 split ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
# Display the shapes of the datasets
print(X.shape, X_train.shape, X_test.shape)

Model Training using Logistic Regression

In [None]:
# Create a Logistic Regression model
model = LogisticRegression()

In [None]:
# Train the Logistic Regression model using the training data
model.fit(X_train, Y_train)

Model Evaluation using Accuracy Score

In [None]:
# Calculate and display the accuracy on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy on Training data: ', training_data_accuracy)

In [None]:
# Calculate and display the accuracy on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data: ', test_data_accuracy)