<a href="https://colab.research.google.com/github/git-shashank-hp/Structured-ML-Credit-Card-Fraud-Detection-Project/blob/main/Structured_ML_Credit_Card_Fraud_Detection_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

In the banking industry, detecting credit card fraud using machine learning is not just a trend; it is a necessity for banks, as they need to implement proactive monitoring and fraud prevention mechanisms. Machine learning helps these institutions reduce time-consuming manual reviews, costly chargebacks and fees, and the denial of legitimate transactions.

Suppose you are part of the analytics team working on a fraud detection model and its cost-benefit analysis. You need to develop a machine learning model to detect fraudulent transactions based on the historical transactional data of customers with a pool of merchants. You can learn more about transactional data and the creation of historical variables from the link attached here. This will be helpful for the capstone project when building the fraud detection model. Based on your understanding of the model, you must analyze the business impact of fraudulent transactions and recommend the optimal ways that the bank can adopt to mitigate fraud risks.

---

## Understanding and Defining Fraud

Credit card fraud is any dishonest act or behavior to obtain information without the proper authorization of the account holder for financial gain. Among the different ways of committing fraud, skimming is the most common one. Skimming is a method used for duplicating information located on the magnetic stripe of the card. Apart from this, other ways of making fraudulent transactions include:

- Manipulation or alteration of genuine cards
- Creation of counterfeit cards
- Stolen or lost credit cards
- Fraudulent telemarketing

## Data Understanding

This is a simulated data set taken from the Kaggle website and contains both legitimate and fraudulent transactions. You can download the dataset using this [link](#).

The dataset contains credit card transactions of around 1,000 cardholders with a pool of 800 merchants from January 1, 2019, to December 31, 2020. It contains a total of **1,852,394 transactions**, out of which **9,651 are fraudulent transactions**. The dataset is highly imbalanced, with the positive class (frauds) accounting for only **0.52%** of the total transactions. Since the dataset is highly imbalanced, it needs to be handled before model building.

The features in the dataset include:

- `amt`: Represents the transaction amount.
- `is_fraud`: A binary label where 1 indicates a fraudulent transaction and 0 indicates a legitimate transaction.

---


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Mounted at /content/gdrive


# Project Pipeline




## Step 1 - Load Dataset

### Understanding Data

In this step, I load the data and understand the features present in it. This will help to choose the features that is needed for final model.


In [2]:
file_path = '/content/gdrive/MyDrive/Colab Notebooks/CCDP/fraudTrain.csv'

In [3]:
df = pd.read_csv(file_path)

In [4]:
# Top 5 rows of data

df.head(5)



Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [9]:
# To view all columns Present in the dataset

df.columns

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

## About this File

This dataset contains information about credit card transactions. The primary goal is to build a model to predict fraudulent transactions based on various features. Below are the descriptions of the features in the dataset:

### Features:

- **index**: Unique identifier for each row.
- **trans_date_trans_time**: Date and time of the transaction.
- **cc_num**: Credit card number of the customer.
- **merchant**: Name of the merchant where the transaction occurred.
- **category**: Category of the merchant (e.g., retail, service, etc.).
- **amt**: Amount of the transaction.
- **first**: First name of the credit card holder.
- **last**: Last name of the credit card holder.
- **gender**: Gender of the credit card holder.
- **street**: Street address of the credit card holder.
- **city**: City of the credit card holder.
- **state**: State of the credit card holder.
- **zip**: ZIP code of the credit card holder.
- **lat**: Latitude location of the credit card holder.
- **long**: Longitude location of the credit card holder.
- **city_pop**: Population of the credit card holder's city.
- **job**: Job of the credit card holder.
- **dob**: Date of birth of the credit card holder.
- **trans_num**: Unique transaction number.
- **unix_time**: UNIX timestamp of the transaction.
- **merch_lat**: Latitude location of the merchant.
- **merch_long**: Longitude location of the merchant.
- **is_fraud**: Fraud flag indicating whether the transaction is fraudulent (Target Class).  
  - **1**: Fraudulent transaction.
  - **0**: Legitimate transaction.

---

In [11]:
 # To understand the dataset - with datatypes - null values

 df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [14]:
# Discriptive Analysis

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1296675.0,648337.0,374318.0,0.0,324168.5,648337.0,972505.5,1296674.0
cc_num,1296675.0,4.17192e+17,1.308806e+18,60416210000.0,180042900000000.0,3521417000000000.0,4642255000000000.0,4.992346e+18
amt,1296675.0,70.35104,160.316,1.0,9.65,47.52,83.14,28948.9
zip,1296675.0,48800.67,26893.22,1257.0,26237.0,48174.0,72042.0,99783.0
lat,1296675.0,38.53762,5.075808,20.0271,34.6205,39.3543,41.9404,66.6933
long,1296675.0,-90.22634,13.75908,-165.6723,-96.798,-87.4769,-80.158,-67.9503
city_pop,1296675.0,88824.44,301956.4,23.0,743.0,2456.0,20328.0,2906700.0
unix_time,1296675.0,1349244000.0,12841280.0,1325376000.0,1338751000.0,1349250000.0,1359385000.0,1371817000.0
merch_lat,1296675.0,38.53734,5.109788,19.02779,34.73357,39.36568,41.95716,67.51027
merch_long,1296675.0,-90.22646,13.77109,-166.6712,-96.89728,-87.43839,-80.2368,-66.9509


## Step 2 -  Exploratory Data Analysis (EDA)

In this step, you need to perform **univariate** and **bivariate** analyses of the data, followed by feature transformations, if necessary. EDA helps in understanding the underlying structure and distribution of the data, as well as identifying potential issues like skewness or missing values that can impact model performance.

## Steps in EDA:

### 1. Univariate Analysis
Univariate analysis involves examining the distribution and summary statistics of each feature independently. This helps to understand the characteristics of individual features.

- **Numerical Features**:
  - Check the **mean**, **median**, **standard deviation**, and **range** of continuous variables such as `amt`, `city_pop`, `lat`, `long`, etc.
  - Plot histograms or box plots to visualize the distribution.
  
- **Categorical Features**:
  - Analyze the frequency distribution of categorical variables like `gender`, `category`, `job`, etc.
  - Use bar charts or pie charts to visualize the count of each category.

### 2. Bivariate Analysis
Bivariate analysis involves analyzing the relationships between two variables. This helps to identify correlations, trends, and patterns that could be useful for predictive modeling.

- **Correlation**:
  - Use **correlation matrices** to check how numerical variables are correlated with each other.
  - Visualize using heatmaps to easily identify strong correlations.

- **Fraud vs. Legitimate Transactions**:
  - Analyze how features like `amt`, `city_pop`, `gender`, and `category` vary between fraudulent and legitimate transactions.
  - Visualize the differences using box plots, bar plots, or violin plots, and analyze the distribution for each class of the target variable (`is_fraud`).

### 3. Check for Skewness
### Skewness: What Is It?

**Skewness** is a statistical measure that describes the asymmetry or lopsidedness of the distribution of data. When the data is not symmetrically distributed, it is said to be skewed. Understanding skewness is important because it can affect the performance of certain machine learning models, particularly if the data is heavily skewed and is not properly transformed.

### Types of Skewness:

### 1. Positive Skew (Right Skew):
- The **right tail** (larger values) of the data is longer than the left tail.
- The **mean** is greater than the **median**.
- Most of the data points are clustered towards the **lower values**, with a few high values pulling the mean to the right.
- **Example**: **Income distribution** — Most people earn a lower or average income, with a few people earning very high incomes, causing a long right tail.

### 2. Negative Skew (Left Skew):
- The **left tail** (smaller values) of the data is longer than the right tail.
- The **mean** is less than the **median**.
- Most of the data points are clustered towards the **higher values**, with a few low values pulling the mean to the left.
- **Example**: **Age at retirement** — Most people retire around a certain age, but there are a few cases of very early retirements that create a left tail.

### 3. No Skew (Symmetry):
- A perfectly **symmetric distribution** has no skew. The **mean**, **median**, and **mode** are all equal.
- **Example**: **Height distribution** in a balanced population might be roughly symmetric.

### Why Is Skewness Important?

### 1. Impact on Modeling:
Some machine learning algorithms assume that the data is **normally distributed** (e.g., **Linear Regression**, **Logistic Regression**). If the data is skewed, these models may perform poorly because they are not able to handle such distributions efficiently.

### 2. Skewed Data and Predictions:
- **Skewed data**, especially highly **positive skew**, may lead to overestimation or underestimation of the model’s predictions.
- For instance, in **fraud detection**, if the amount of fraud is highly skewed, the model may be biased toward predicting **non-fraudulent transactions** because they dominate the data.

### Summary:
- **Skewness** measures the **asymmetry** of a distribution.
- **Positive skew** means a long tail on the right, and **negative skew** means a long tail on the left.
- **Highly skewed data** may require **transformations** to improve model performance.

Skewed data can lead to biased model predictions. It's important to check the skewness of the distribution of continuous variables.

- Use **skewness** and **kurtosis** metrics to quantify the skew in the data.
- If features are highly skewed (e.g., transaction amounts), consider applying transformations like **log transformation**, **Box-Cox**, or **Yeo-Johnson** to normalize the data.

### 4. Missing Data and Outliers
- **Missing Data**: Check for missing values in any columns and decide on how to handle them (e.g., imputation, removal).
- **Outliers**: Detect outliers using box plots or Z-scores and decide whether to treat or remove them, depending on their impact.

### 5. Feature Engineering
- **Feature Transformation**: Apply transformations where necessary (e.g., scaling, encoding categorical variables, creating new features like transaction time-based features).
- **Feature Creation**: Create new features from existing ones, such as the **age** of the cardholder from `dob` or the **distance** between cardholder and merchant locations from `lat` and `long`.

### 6. Visualizations
Visualizations help to identify trends, patterns, and anomalies in the data, providing insights that may not be apparent from raw data alone.

- **Histograms and Density Plots** for distribution of numeric variables.
- **Bar Plots** for categorical data.
- **Box Plots** for comparing the distribution of features across different target classes (fraud vs. non-fraud).
- **Pair Plots** or **Scatter Plots** for detecting relationships between numeric features.

---

By performing these analyses, you'll better understand the data, identify potential issues, and prepare it for model building.


In [16]:
# duplicate values in dataset

df.duplicated().sum()


0

### No Duplicate Values in the Dataset

To ensure that the dataset does not contain any duplicate values, we can check for duplicates and remove them if necessary.

If there are duplicate rows, you can use the following code to drop them:

```python
df.drop_duplicates(inplace=True)
```
### Key Points:
- **drop_duplicates()** removes duplicate rows from the DataFrame.
- **inplace=True** ensures the changes are applied directly to the original DataFrame without creating a new one.
