# Fraud Detection

## Dataset exploration

### [Dataset](https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data) Description: Fraudulent Transactions Prediction

### Context  
The dataset supports the development of a model to predict fraudulent transactions for a financial company. Insights derived from the model are intended to inform an actionable plan. The data is in CSV format, containing **6,362,620 rows** and **10 columns**, simulating 30 days of transactions.

### Content  
The dataset includes the following features:

- **`step`**: Time unit in hours (1 step = 1 hour, total steps = 744).
- **`type`**: Type of transaction - `CASH-IN`, `CASH-OUT`, `DEBIT`, `PAYMENT`, `TRANSFER`.
- **`amount`**: Transaction amount in local currency.
- **`nameOrig`**: Customer initiating the transaction.
- **`oldbalanceOrg`**: Initial balance of the sender before the transaction.
- **`newbalanceOrig`**: New balance of the sender after the transaction.
- **`nameDest`**: Recipient of the transaction.
- **`oldbalanceDest`**: Initial balance of the recipient before the transaction (missing for merchants, indicated by names starting with "M").
- **`newbalanceDest`**: New balance of the recipient after the transaction (missing for merchants).
- **`isFraud`**: Indicates if the transaction was fraudulent (simulated fraudulent behavior involves taking control of accounts to transfer and withdraw funds).
- **`isFlaggedFraud`**: Flags illegal attempts, defined as transferring amounts exceeding 200,000 in a single transaction.

### Inspiration  
The dataset enables analysis and answering of key questions, including:

- Data cleaning to address missing values, outliers, and multi-collinearity.
- Elaboration on the fraud detection model and variable selection process.
- Evaluation of model performance using appropriate tools.
- Identification of key predictors for fraudulent transactions.
- Logical analysis of the predictive factors.
- Recommendations for prevention strategies to enhance infrastructure security.
- Measurement of the effectiveness of implemented actions.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('raw_data/Fraud.csv')

print(f"Shape: {data.shape}")
data.head(3)

Shape: (6362620, 11)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0


In [3]:
# Checking for null values
data.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [None]:
# Check for class imbalance
non_fraudulent = len(data[data['isFraud'] == 0])
fraudulent = len(data[data['isFraud'] == 1])

non_fraudulent_perc = np.round(non_fraudulent/data.shape[0], decimals=4)
fraudulent_perc = np.round(fraudulent/data.shape[0], decimals=4)

print(f"Non-fraudulent transactions: {non_fraudulent}\nFraudulent transactions: {fraudulent}")

print("_"*100)

print(f"Non-fraudulent transactions: {non_fraudulent_perc}\nFraudulent transactions: {fraudulent_perc}")

Non-fraudulent transactions: 6354407
Fraudulent transactions: 8213
____________________________________________________________________________________________________
Non-fraudulent transactions: 0.9987
Fraudulent transactions: 0.0013


In [None]:
# Checking if all FlaggedFrauds are classified as 1 in isFrauds column
assert len(data[(data['isFlaggedFraud'] == 1) & (data['isFraud'] == 1)]) == len(data[data['isFlaggedFraud'] == 1])