# Fraud Detection – Data Understanding (EDA v1)

**Dataset:** IEEE-CIS Fraud Detection  
**Author:** Bruno Martins  

---

## 0. Title & Context

This notebook performs structured data understanding for the IEEE-CIS fraud dataset.

**Goal:** Evaluate whether this dataset supports building a production-like antifraud system for card-not-present (CNP) e-commerce transactions.

**Scope:** Analyze dataset structure, data quality, fraud distribution, leakage risks, and implications for modeling and system architecture.

**Outcome:** Provide an informed assessment of whether the IEEE-CIS dataset is suitable for developing an end-to-end antifraud pipeline.

## 1. Executive Summary

The IEEE-CIS dataset represents card-not-present (CNP) e-commerce transactions enriched with device, payment, and anonymized behavioral metadata.

**Key findings**

• Fraud rate is highly imbalanced (~1–3%), requiring cost-sensitive modeling.  
• Identity information is missing for a significant portion of transactions.  
• Temporal structure is present and must be respected in validation.  
• Some engineered anonymized features may introduce leakage risk.

**Conclusion**

The dataset is suitable for developing a realistic fraud detection pipeline, provided that class imbalance, temporal validation, identity coverage, and potential leakage are handled carefully.

In [35]:
from pathlib import Path
import pandas as pd

DATA_PATH = Path("../data/raw")

files = list(DATA_PATH.glob("*"))
files

[WindowsPath('../data/raw/sample_submission.csv'),
 WindowsPath('../data/raw/test_identity.csv'),
 WindowsPath('../data/raw/test_transaction.csv'),
 WindowsPath('../data/raw/train_identity.csv'),
 WindowsPath('../data/raw/train_transaction.csv')]

## 2. Dataset Context

The IEEE-CIS dataset is composed of two main tables:

• **transaction** table  
• **identity** table  

Both tables are joined by `TransactionID`.

The data originates from real-world card-not-present (CNP) e-commerce transactions provided by Vesta Corporation.

**Known limitations**

`TransactionDT` represents a time delta from a reference point rather than a real calendar timestamp. This must be considered when performing temporal analysis to avoid incorrect interpretations.

In [None]:
# Load a small sample for faster exploratory analysis.
# Remove nrows to load the full dataset in later stages.

train_tr = pd.read_csv(DATA_PATH / "train_transaction.csv", nrows=1000)
train_id = pd.read_csv(DATA_PATH / "train_identity.csv", nrows=1000)

train_tr.shape, train_id.shape

((1000, 394), (1000, 41))

## 3. Data Model

We evaluate the relational structure of the dataset to understand how the transaction and identity tables interact.

Specifically, we analyze:

• Join coverage – proportion of transactions with identity information  
• Join cardinality – whether the join introduces duplicate rows  
• Missing identity records – impact on downstream modeling  

This step helps determine whether identity features can be reliably used in a production fraud detection system.

In [44]:
# Merge transaction and identity tables to evaluate join coverage and duplicates
train_full = train_tr.merge(train_id, on="TransactionID", how="left")

identity_coverage = train_full["DeviceType"].notnull().mean()
n_rows_after = len(train_full)
dup_after_join = train_full["TransactionID"].duplicated().any()

identity_coverage, n_rows_after, dup_after_join

(np.float64(0.289), 1000, np.False_)

**Conclusion:** Identity information is available for only ~29% of transactions (sample estimate). The join is one-to-one, with no duplicate rows introduced. Models that rely heavily on identity features may perform poorly on the majority of transactions; therefore, the system must handle missing identity data robustly.

In [None]:
numeric_cols = train_tr.select_dtypes(include="number").columns
categorical_cols = train_tr.select_dtypes(include="object").columns

print(f"Numeric columns: {len(numeric_cols)}")
print(f"Categorical columns: {len(categorical_cols)}")

Numeric columns: 380
Categorical columns: 14


## 4. Feature Taxonomy

Features can be grouped into the following categories:

• Payment metadata (card1–card6, ProductCD)  
• Device information (DeviceType, DeviceInfo) – evidence of device fingerprinting  
• Identity features (id_*)  
• Temporal features (TransactionDT)  
• Engineered anonymized features (V*, D*) – underlying semantics not publicly available  

**Conclusion:** Many categorical variables have high cardinality, requiring careful encoding strategies. The anonymized V* and D* features may limit interpretability and should be evaluated for potential leakage and stability.

## 5. Data Quality

We assess data quality to identify issues that may impact modeling and deployment.

Checks include:

• Missing values  
• Duplicate records  
• Outliers in numerical features  
• Consistency between train and test datasets

In [47]:
# Estimate missing value rates using a sample of the transaction table
missing_rate = train_tr.isnull().mean().sort_values(ascending=False)
missing_rate.head(20)

D7       0.934
dist2    0.908
D13      0.863
D9       0.861
D8       0.861
D14      0.861
V338     0.847
V336     0.847
V339     0.847
V337     0.847
V335     0.847
V147     0.847
V327     0.847
V325     0.847
V326     0.847
V331     0.847
V330     0.847
V329     0.847
V332     0.847
V324     0.847
dtype: float64

In [48]:
# Check uniqueness of TransactionID and inspect TransactionAmt distribution
no_dup_tr = train_tr["TransactionID"].nunique() == len(train_tr)
amt = train_tr["TransactionAmt"]
amt_q99 = amt.quantile(0.99)

no_dup_tr, amt.min(), amt.max(), amt_q99

(True, np.float64(1.896), np.float64(3162.95), np.float64(2295.2601999999997))

**Conclusions**

• Many features have very high missing rates → models must explicitly handle missingness (e.g., imputation, missing indicators, or sparse-aware models).  
• The identity table has partial coverage (see Data Model section).  
• Several categorical features contain rare categories → robust encoding strategies are required (e.g., target encoding, frequency encoding, or grouping rare levels).  
• TransactionAmt shows a long-tailed distribution → log transformation or clipping may improve model stability.  
• No duplicate TransactionIDs were detected in the transaction table (sample check).

## 6. Target Analysis

We analyze the fraud target to understand its overall rate and how it varies across transaction segments such as product type, device information, and time.

This helps determine class imbalance severity and guides modeling and evaluation strategy.

In [52]:
# Overall fraud rate
fraud_rate = train_tr['isFraud'].mean()
fraud_rate

np.float64(0.015)

In [53]:
# Fraud rate by product category
train_tr.groupby("ProductCD")['isFraud'].mean().sort_values(ascending=False)

ProductCD
S    0.115385
C    0.049689
R    0.034483
W    0.004373
H    0.000000
Name: isFraud, dtype: float64

In [None]:
# Compare fraud rates for desktop, mobile, and transactions without identity data
train_full.groupby("DeviceType", dropna=False)["isFraud"].mean()

DeviceType
desktop    0.024845
mobile     0.062500
NaN        0.004219
Name: isFraud, dtype: float64

**Conclusion**

The overall fraud rate is low (~1.5%), confirming strong class imbalance. Fraud risk varies across transaction segments such as product category and device type, indicating heterogeneous fraud patterns. Device information provides useful signals, but identity coverage is incomplete, so models must remain robust when identity data is missing.

These findings imply that modeling must be cost-sensitive, segmentation-aware, and evaluated using metrics aligned with business impact (e.g., expected loss or recall at controlled false-positive rates).

## 7. Temporal Structure

We analyze the temporal structure of the dataset using `TransactionDT`, which represents a time delta from a reference point.

The goal is to understand whether fraud patterns change over time and whether time-based validation is required.

In [55]:
# Estimate fraud rate over time using deciles of TransactionDT
train_tr["dt_decile"] = pd.qcut(
    train_tr["TransactionDT"].rank(method="first"),
    q=5,
    labels=False,
    duplicates="drop"
)

train_tr.groupby("dt_decile", observed=True)["isFraud"].mean()

dt_decile
0    0.000
1    0.030
2    0.005
3    0.030
4    0.010
Name: isFraud, dtype: float64

In [42]:
train_tr['TransactionDT'].describe()

count      1000.00000
mean      96340.39500
std        7042.11914
min       86400.00000
25%       90330.75000
50%       94863.50000
75%      100976.00000
max      112151.00000
Name: TransactionDT, dtype: float64

**Conclusion**

Fraud rates vary across time segments, indicating temporal instability in fraud patterns. This confirms that random train/test splits may introduce leakage and overestimate model performance.

Model validation must therefore use time-based cross-validation, and the production system should include monitoring for performance drift over time.

## 8. Identity Coverage

We evaluate how many transactions contain identity information and whether fraud patterns differ between transactions with and without identity data.

This helps determine whether identity features are reliable signals and whether missing identity may introduce bias.

In [56]:
# Compare fraud rates for transactions with and without identity information
train_full['has_identity'] = train_full['DeviceType'].notnull()
train_full.groupby('has_identity')['isFraud'].mean()

has_identity
False    0.004219
True     0.041522
Name: isFraud, dtype: float64

**Conclusion**

Identity information is missing for a large portion of transactions, and fraud rates differ between transactions with and without identity data. This suggests that identity availability is not random and may introduce selection bias.

Models must therefore perform well even when identity features are absent, using transaction-level features or explicit missing indicators. Identity features can improve detection but should not be relied upon as the primary signal.

## 9. Leakage Risk

We assess potential sources of data leakage that could lead to overly optimistic model performance during offline validation.

**Potential leakage sources**

• Temporal features that may encode information from future events  
• Engineered anonymized features (D*, V*) whose construction is unknown  
• Improper random train/test splits that mix different time periods  

**Mitigation strategies**

• Use time-based cross-validation aligned with TransactionDT  
• Perform feature ablation tests (e.g., train models without D* or V* features)  
• Validate models on a true future holdout dataset

**Conclusion**

Leakage is a significant risk in this dataset due to anonymized engineered features and temporal dependencies. Validation must therefore be strictly time-based, and model performance should be verified through feature ablation and future-period holdouts to ensure realistic results.

## 10. Dataset Limitations

The IEEE-CIS dataset has several important limitations:

• Features are anonymized, limiting interpretability and domain-specific feature engineering  
• Fraud labels may be delayed due to chargebacks or manual reviews  
• No merchant-level information is available  
• No graph or network data is provided  

**Impact**

Some fraud patterns, such as merchant-specific behavior or fraud rings, cannot be modeled. These limitations define the realistic scope of the prototype system and should be considered when interpreting model performance.

## 11. Implications for Architecture

The data analysis suggests the following requirements for a production-like antifraud system:

• Real-time feature availability is feasible for most transaction-level features  
• Missing identity information requires models that handle sparse inputs or fallback scoring paths  
• Temporal instability implies the need for continuous performance monitoring and retraining  
• Prediction logging is required for auditability, debugging, and drift detection  
• Strong class imbalance requires threshold optimization based on expected financial loss  

These findings connect the data understanding phase with the system architecture defined in `04_architecture.md`.

## 12. Next Steps for Modeling

The following steps will guide the modeling phase:

• Build a logistic regression baseline to establish a reference model  
• Use time-based cross-validation aligned with TransactionDT  
• Define cost-based evaluation metrics and optimize decision thresholds (e.g., expected loss)  
• Evaluate models with and without identity features to assess robustness  
• Calibrate predicted probabilities to support reliable decision-making  

These steps will be formalized in `03_modeling_strategy.md`.