In [1]:
# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.14.5
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# # Exploratory Data Analysis: Fraud Detection Dataset
# 
# **Objective:** To explore the `historical_transactions.csv` dataset, understand its structure, identify patterns related to fraudulent transactions, and gather insights to inform our machine learning model.

# ## 1. Setup and Data Loading
# 
# First, let's import the necessary libraries and load our dataset into a Pandas DataFrame.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set a style for our plots
sns.set_style("whitegrid")

# Load the dataset
data_path = os.path.join('..', 'data', 'historical_transactions.csv')
df = pd.read_csv(data_path)

# Display the first few rows to get a feel for the data
df.head()

# ## 2. Initial Data Inspection
# 
# Let's get a high-level overview of the dataset's structure, data types, and summary statistics.

# Check the shape of the DataFrame (rows, columns)
print(f"Dataset shape: {df.shape}\n")

# Get concise summary of the DataFrame
print("Data Info:")
df.info()

# It looks like the `timestamp` column is an object (string). We should convert it to a datetime object for time-based analysis.
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Generate descriptive statistics for numerical columns
print("\nDescriptive Statistics:")
df.describe()

# ### Initial Observations:
# - The dataset contains 500 transactions.
# - The `amount` column has a very large standard deviation and a maximum value much higher than the 75th percentile, suggesting the presence of outliers (which could be our fraudulent transactions).
# - `is_international` and `is_fraud` are binary, as expected.

# ## 3. Analyzing the Target Variable (`is_fraud`)
# 
# Understanding the distribution of our target variable is crucial, especially in fraud detection where datasets are often highly imbalanced.

# Calculate the distribution of fraudulent vs. non-fraudulent transactions
fraud_distribution = df['is_fraud'].value_counts(normalize=True) * 100
print("Fraud Distribution (%):")
print(fraud_distribution)

# Visualize the distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='is_fraud', data=df, palette=['#3498db', '#e74c3c'])
plt.title('Distribution of Fraudulent vs. Legitimate Transactions')
plt.xlabel('Is Fraud? (0: No, 1: Yes)')
plt.ylabel('Number of Transactions')
plt.show()

# ### Observation:
# - The dataset is **highly imbalanced**. Fraudulent transactions make up only a small percentage of the total. This is a critical insight for model building, as we'll need to use techniques that handle imbalanced data well (like the Isolation Forest model we chose).

# ## 4. Exploring Key Features vs. Fraud
# 
# Let's see how different features relate to fraudulent activity.

# ### Transaction Amount
# 
# How does the transaction amount differ for fraudulent and legitimate transactions?

plt.figure(figsize=(12, 7))
sns.boxplot(x='is_fraud', y='amount', data=df, palette=['#3498db', '#e74c3c'])
plt.title('Transaction Amount vs. Fraud Status')
plt.xlabel('Is Fraud? (0: No, 1: Yes)')
plt.ylabel('Transaction Amount')
plt.yscale('log') # Use a log scale to better visualize the wide range of amounts
plt.show()

# ### Observation:
# - Fraudulent transactions have a **significantly higher median amount** and a much wider distribution, confirming our initial suspicion from the summary statistics. High-value transactions are a strong indicator of potential fraud.

# ### International Transactions
# 
# Are international transactions more likely to be fraudulent?

# Create a cross-tabulation to see the relationship
international_fraud_ct = pd.crosstab(df['is_international'], df['is_fraud'])
print(international_fraud_ct)

# Visualize it
international_fraud_ct.plot(kind='bar', stacked=True, figsize=(10, 7), color=['#3498db', '#e74c3c'])
plt.title('Fraud Status by International vs. Domestic Transactions')
plt.xlabel('Is International? (0: No, 1: Yes)')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=0)
plt.legend(title='Is Fraud?', labels=['Legitimate', 'Fraud'])
plt.show()

# ### Observation:
# - A large proportion of international transactions are fraudulent compared to domestic ones. This feature is another strong predictor.

# ### Merchant Category
# 
# Are certain merchant categories more prone to fraud?

plt.figure(figsize=(14, 8))
sns.countplot(y='merchant_category', hue='is_fraud', data=df, order=df['merchant_category'].value_counts().index, palette=['#3498db', '#e74c3c'])
plt.title('Fraud Distribution Across Merchant Categories')
plt.xlabel('Number of Transactions')
plt.ylabel('Merchant Category')
plt.legend(title='Is Fraud?', labels=['Legitimate', 'Fraud'])
plt.show()

# ### Observation:
# - Fraudulent transactions appear most frequently in `luxury`, `electronics`, and `travel` categories. These are often high-ticket items, which makes sense for fraudulent activity.

# ## 5. Summary of Findings
# 
# Based on this exploratory analysis, we've identified several key patterns that will be valuable for our fraud detection model:
# 
# 1.  **Imbalanced Data:** The dataset is highly imbalanced, which justifies the use of an anomaly detection algorithm like Isolation Forest rather than a standard classification model.
# 2.  **High Transaction Amounts:** Fraudulent transactions are strongly correlated with unusually high transaction amounts.
# 3.  **International Transactions:** Transactions that are international are significantly more likely to be fraudulent.
# 4.  **High-Risk Categories:** The `luxury`, `electronics`, and `travel` merchant categories are hotspots for fraudulent activity.
# 
# These insights confirm that the features `amount` and `is_international` are excellent choices for our initial model. For a more advanced model, we could also incorporate `merchant_category` by encoding it into a numerical format.


Generating initial training data...
Training fraud detection model...
Model training complete.

--- Starting Real-time Fraud Detection ---

Received new batch of transactions:
   amount  is_international
0   84.25                 0
1  196.90                 1
2   88.17                 1
3  115.30                 1
4   83.04                 0
No fraudulent activity detected in this batch.

Received new batch of transactions:
   amount  is_international
0   38.89                 1
1  175.04                 1
2  150.92                 1
3   90.04                 1
4   53.09                 0
No fraudulent activity detected in this batch.

Received new batch of transactions:
   amount  is_international
0  132.06                 0
1  102.56                 0
2  176.55                 1
3  163.59                 0
4   10.47                 0
No fraudulent activity detected in this batch.

Received new batch of transactions:
   amount  is_international
0  160.90                 0
1  173.55   