# Detecting Data Quality Issues in Credit Card Fraud Detection Dataset

This example uses **Datalab** to auto-detect various issues in a tabular dataset commonly encountered in financial applications. Specifically, the [Credit Card Fraud Detection dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), which you should first download and ensure the data file is named **credit_card_fraud_dataset.csv**. This dataset contains thousands of transaction records labeled as fraudulent or non-fraudulent, and features about each transaction such as the transaction amount plus other variables anonymized for privacy. 

### Cleanlab Helps Uncover:
- **Label errors**: Mislabeled transactions, such as fraudulent cases incorrectly marked as non-fraudulent.
- **Outliers**: Transactions with abnormal patterns that deviate significantly from the rest of the dataset.
- **Near-duplicates**: Repeated transactions or entries that may distort results or impact model performance.

Using Cleanlab, we automatically identify examples that are likely mislabeled or otherwise problematic, improving the overall data quality for better fraud detection performance. You can adapt this tutorial to detect and correct issues in your own financial tabular datasets.

## 1. Install and Import Dependencies

In [None]:
!pip install "cleanlab[all]"

In [1]:
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from cleanlab import Datalab

# Optional: set seed for reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

## 2. Load and Process the Data

In [3]:
fraud_data = pd.read_csv("credit_card_fraud_dataset.csv")
fraud_data.head()

Unnamed: 0,TransactionID,TransactionDate,Amount,MerchantID,TransactionType,Location,IsFraud
0,1,2024-04-03 14:15:35.462794,4189.27,688,refund,San Antonio,0
1,2,2024-03-19 13:20:35.462824,2659.71,109,refund,Dallas,0
2,3,2024-01-08 10:08:35.462834,784.0,394,purchase,New York,0
3,4,2024-04-13 23:50:35.462850,3514.4,944,purchase,Philadelphia,0
4,5,2024-07-12 18:51:35.462858,369.07,475,purchase,Phoenix,0


In [4]:
# Select relevant features and labels
X_raw = fraud_data[["Amount", "TransactionType", "Location"]]
y = fraud_data["IsFraud"]

We will now preprocess the dataset to prepare it for analysis. This involves:
1. Selecting relevant features (e.g., `Amount`, `TransactionType`, `Location`).
2. Encoding categorical variables (e.g., `TransactionType` and `Location`) using one-hot encoding.
3. Standardizing numerical variables (e.g., `Amount`) to ensure all features are on a similar scale.

Next, we assign the preprocessed features to `X` and the labels (`IsFraud`) to `y`.

In [5]:
# One-hot encode categorical features
categorical_features = ["TransactionType", "Location"]
X_encoded = pd.get_dummies(X_raw, columns=categorical_features, drop_first=True)

# Standardize numerical features
numeric_features = ["Amount"]
scaler = StandardScaler()
X_encoded[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])

# Display preprocessed data
print(X_encoded.head())

     Amount  TransactionType_refund  Location_Dallas  Location_Houston  \
0  1.173161                    True            False             False   
1  0.112740                    True             True             False   
2 -1.187661                   False            False             False   
3  0.705284                   False            False             False   
4 -1.475326                   False            False             False   

   Location_Los Angeles  Location_New York  Location_Philadelphia  \
0                 False              False                  False   
1                 False              False                  False   
2                 False               True                  False   
3                 False              False                   True   
4                 False              False                  False   

   Location_Phoenix  Location_San Antonio  Location_San Diego  \
0             False                  True               False   
1         

## 3. Train a Classification Model and Compute Out-of-Sample Predicted Probabilities

To detect potential label errors in the **Credit Card Fraud Detection dataset**, Cleanlab requires **probabilistic predictions** for every data point. However, predictions generated on the same data used for training can be **overfitted** and unreliable. For accurate results, Cleanlab works best with **out-of-sample** predicted class probabilities—i.e., predictions for data points excluded from the model during training.

#### How We Generate Out-of-Sample Predictions

We use **K-fold cross-validation**, which:
1. Splits the dataset into `K` folds.
2. Trains the model on `K-1` folds and predicts probabilities on the excluded fold.
3. Repeats this for all folds so that every data point gets a prediction from a model that has not seen it during training.

This ensures every data point has **out-of-sample predicted probabilities**.

#### Model: Logistic Regression

For this example, we use **Logistic Regression**, a simple and interpretable model commonly used in fraud detection tasks. It predicts the probability of each class (`0` for non-fraud, `1` for fraud) based on the input features of an example in the dataset. The same approach will work with *any* Machine Learning model.

In [6]:
clf = LogisticRegression(max_iter=1000, random_state=SEED)

In [7]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    X_encoded,     # Preprocessed feature matrix
    y,             # Labels
    cv=num_crossval_folds,
    method="predict_proba"  # Get predicted probabilities
)

print("Shape of predicted probabilities:", pred_probs.shape)

Shape of predicted probabilities: (100000, 2)


## 4. Construct a K Nearest Neighbors Graph (Optional)

The **KNN graph** represents the similarity between examples in the dataset.
Here, we'll define similarity using the **Euclidean distance** between our normalized feature values.

Note that this step is *optional*. If you pass in numerical `features` to Datalab but no KNN graph, then Datalab will internally construct its own KNN graph.
You can provide your KNN graph as done here to exert greater control over this process, or do this whenever your data aren't in a numerical format or you have a massive dataset (use [approxmate KNN](https://docs.cleanlab.ai/stable/tutorials/datalab/workflows.html#Accelerate-Issue-Checks-with-Pre-computed-kNN-Graphs) in that case).

Here we use scikit-learn's `NearestNeighbors` class to construct this graph:
1. Compute pairwise distances between all examples.
2. Represent the graph as a sparse matrix, with nonzero entries indicating the distance to nearest neighbors.

In [8]:
# Create a KNN model with Euclidean distance as the metric
knn = NearestNeighbors(metric="euclidean")

# Fit the KNN model to the preprocessed feature values
knn.fit(X_encoded.values)

# Construct the KNN graph as a sparse matrix
knn_graph = knn.kneighbors_graph(mode="distance")

## 5. Use Datalab to Find Dataset Issues

With the given labels, predicted probabilities, and the KNN graph (optional), Datalab can automatically identifies various issues in the dataset.

In [9]:
from cleanlab import Datalab
# Wrap the dataset into a dictionary
data = {"X": X_encoded.values, "y": y}

# Create a Datalab object
lab = Datalab(data, label_name="y")

# Use Cleanlab to find issues in the dataset
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)  # could provide features here instead of knn_graph

Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 12043 issues found in the dataset.


In [10]:
lab.report()

Dataset Information: num_examples: 100000, num_classes: 2

Here is a summary of various issues found in your data:

     issue_type  num_issues
 near_duplicate        8639
        outlier        1797
class_imbalance        1000
          label         607

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


------------------ near_duplicate issues -------------------

About this issue:
	A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).
    

Number of examples with this issue: 8639
Overall dataset qual

## Label Issues
The report indicates that Cleanlab identified several label issues in the dataset. These are data entries where the given labels may not match the actual label, as estimated by Cleanlab. Each issue includes a numeric label score that quantifies how likely the label is correct (lower scores indicate higher likelihood of being mislabeled).

In [11]:
# Retrieve label issues
label_issues = lab.get_issues("label")

print(label_issues.head())

   is_label_issue  label_score  given_label  predicted_label
0           False     0.990469            0                0
1           False     0.991203            0                0
2           False     0.988302            0                0
3           False     0.990321            0                0
4           False     0.991149            0                0


In [12]:
# Filter rows with label issues
label_issues_filtered = label_issues[label_issues['is_label_issue'] == True]
print(label_issues_filtered.head())

     is_label_issue  label_score  given_label  predicted_label
190            True     0.007187            1                0
191            True     0.007622            1                0
208            True     0.007177            1                0
319            True     0.008984            1                0
506            True     0.009220            1                0


In [13]:
# Sort the label issues by label_score (lower scores indicate higher likelihood of being mislabeled)
sorted_issues = label_issues.sort_values("label_score").index

# View the most likely label errors
X_raw.iloc[sorted_issues].assign(
    given_label=y.iloc[sorted_issues],
    predicted_label=label_issues["predicted_label"].iloc[sorted_issues]
).head()


Unnamed: 0,Amount,TransactionType,Location,given_label,predicted_label
6901,346.13,purchase,San Jose,1,0
7933,25.91,refund,San Jose,1,0
13204,963.84,purchase,San Jose,1,0
16276,1093.22,purchase,San Jose,1,0
7546,598.78,refund,San Jose,1,0


### Example Review of Label Issues

The dataframe below shows the original label (`given_label`) for examples that Cleanlab finds most likely to be mislabeled, as well as an alternative `predicted_label` for each example.

| Amount  | TransactionType | Location  | given_label | predicted_label |
|---------|------------------|-----------|-------------|-----------------|
| 346.13  | purchase         | San Jose  | 1           | 0               |
| 25.91   | refund           | San Jose  | 1           | 0               |
| 963.84  | purchase         | San Jose  | 1           | 0               |
| 1093.22 | purchase         | San Jose  | 1           | 0               |
| 598.78  | refund           | San Jose  | 1           | 0               |

These examples may have been labeled incorrectly and should be carefully re-examined:
- **Entry 1**: A purchase of 346.13 labeled as fraudulent (`1`) is predicted to be non-fraudulent (`0`).
- **Entry 2**: A refund of  25.91 is similarly labeled as fraudulent but predicted as non-fraudulent.
- **Entry 4**: A purchase of $1093.22 also seems misclassified as fraudulent.

The predicted labels suggest a potential mislabeling pattern for transactions in `San Jose`. Transactions with relatively lower amounts or refunds might have been mislabeled as fraudulent. This should be reviewed with additional domain knowledge or transaction metadata for confirmation.

Such insights are crucial for improving the dataset's quality and ensuring the model learns from accurate labels.



### Outlier Issues

According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via the `get_issues` method. We sort the resulting DataFrame by Cleanlab’s outlier quality score to see the most severe outliers in our dataset.

In [14]:
outlier_results = lab.get_issues("outlier")
sorted_outliers = outlier_results.sort_values("outlier_score").index

X_raw.iloc[sorted_outliers].head()

Unnamed: 0,Amount,TransactionType,Location
43484,4999.73,purchase,Chicago
4659,2114.37,refund,Philadelphia
67602,3255.47,purchase,San Jose
91994,1147.93,refund,Chicago
52696,4005.05,purchase,San Antonio






#### **Key Observations**:
1. **Entry 1**: A purchase transaction with an unusually high amount of `$4999.73` in Chicago may represent a legitimate but rare high-value transaction or could be indicative of an error.
2. **Entry 2**: A refund for `$2114.37` in Philadelphia seems unusually high compared to typical refund amounts and should be verified.
3. **Entry 5**: Another high-value purchase transaction of `$4005.05` in San Antonio is rare and should be reviewed for validity.

#### **Next Steps**:
- **Investigate Outliers**:
  - Validate whether these transactions are legitimate or the result of data errors.
  - Cross-check these entries against metadata such as timestamps, merchants, and customer profiles for better context.
- **Handle Outliers**:
  - **Retain**: If the transaction is valid, keep it in the dataset for training.
  - **Remove**: If the transaction is deemed erroneous or unrepresentative, exclude it from the dataset to avoid skewing the model's learning.

These steps will ensure that the dataset is representative and does not include suspicious entries that could affect the performance of fraud detection models.
  

### Near-Duplicate Issues

According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by Cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated (the score is equal to 0 if examples are exactly duplicated).

In [15]:
duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
62583,True,0.0,[55080],0.0
30333,True,0.0,[13617],0.0
12827,True,0.0,[15703],0.0
66741,True,0.0,[82920],0.0
45125,True,0.0,[95476],0.0


The results above show which examples Cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that Cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are.

In [18]:
# Identify the row with the lowest near_duplicate_score
lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].idxmin()

# Extract the indices of the lowest scoring duplicate and its near duplicate sets
indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, "near_duplicate_sets"].tolist()

# Display the relevant rows from the original dataset
X_raw.iloc[indices_to_display]


Unnamed: 0,Amount,TransactionType,Location
73,3374.61,refund,New York
19427,3374.61,refund,New York
30450,3374.63,refund,New York


These examples are exact duplicates! Perhaps the same information was accidentally recorded multiple times in this data.

Similarly, let’s take a look at another example and the identified near-duplicate sets:

In [19]:
# Identify the next row not in the previous near duplicate set
second_lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].drop(indices_to_display).idxmin()

# Extract the indices of the second lowest scoring duplicate and its near duplicate sets
next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, "near_duplicate_sets"].tolist()

# Display the relevant rows from the original dataset
X_raw.iloc[next_indices_to_display]

Unnamed: 0,Amount,TransactionType,Location
167,1796.39,refund,New York
53564,1796.39,refund,New York


We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the [FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-Datalab?).

This tutorial highlights a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Cleanlab with any ML model – the better the model, the more accurate the data issues detected by Cleanlab will be!