# Transaction Anomaly Detection
Transaction Anomaly Detection using unsupervised learning methods is a technique used to identify unusual or suspicious patterns in financial transactions without requiring labeled data for training. Since anomalies are often rare and unpredictable, unsupervised learning is particularly useful when normal and fraudulent transactions aren't clearly categorized.
## Importing Required Libraries

In [3]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## 1. Data Exploration & Preprocessing
### 1.1. Examine the data

In [4]:
# Load the dataset
df = pd.read_csv('transactions_data.csv')

# Check the shape of the dataset (rows, columns)
print("Shape of the dataset:", df.shape)

# Get dataset information (column names, non-null counts, data types)
print("Dataset Info:\n")
df.info()

# List all columns (features)
print("Columns in the dataset:", df.columns)

# Check data types of each column
print("Data types:\n", df.dtypes)

# Display first 5 rows
print("\nFirst 5 rows:\n", df.head())

Shape of the dataset: (13305915, 12)
Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13305915 entries, 0 to 13305914
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   id              int64  
 1   date            object 
 2   client_id       int64  
 3   card_id         int64  
 4   amount          object 
 5   use_chip        object 
 6   merchant_id     int64  
 7   merchant_city   object 
 8   merchant_state  object 
 9   zip             float64
 10  mcc             int64  
 11  errors          object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.2+ GB
Columns in the dataset: Index(['id', 'date', 'client_id', 'card_id', 'amount', 'use_chip',
       'merchant_id', 'merchant_city', 'merchant_state', 'zip', 'mcc',
       'errors'],
      dtype='object')
Data types:
 id                  int64
date               object
client_id           int64
card_id             int64
amount             object
use_chip           o

The dataset has 12 columns:
- `id` & `client_id`: Unique identifiers; likely not useful for anomaly detection directly.
- `date`: Transaction timestamps, useful for detecting time-based anomalies.
- `card_id`: Helps track spending patterns per card.
- `amount`: Critical feature; unusual transaction amounts may indicate fraud.
- `use_chip`: Indicates chip usage, which can be relevant for security-based anomalies (e.g., swiping a card in locations that require chips).
- `merchant_id` & `merchant_city/state/zip`: Helps identify unusual locations for spending behavior.
- `mcc` (Merchant Category Code): Useful for detecting spending in unexpected categories.
- `errors`: If this captures transactional errors, it could be a strong anomaly indicator.

## 1.2. Data Cleaning
### 1.2.1. Summarize Numerical Features
For numerical features (e.g., loan amount, interest rate), we calculate descriptive statistics.

In [7]:
# Total missing values per column
print("Missing values per column:\n", df.isnull().sum())

Missing values per column:
 id                       0
date                     0
client_id                0
card_id                  0
amount                   0
use_chip                 0
merchant_id              0
merchant_city            0
merchant_state     1563700
zip                1652706
mcc                      0
errors            13094522
dtype: int64


### 1.2.2. Summarize Categorical Features
For categorical features, we check unique values and their frequencies.

In [8]:
# Summary of categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"Unique values in {col}:\n", df[col].value_counts())
    print("\n")

Unique values in date:
 date
2016-03-03 11:42:00    18
2011-06-09 12:46:00    18
2018-08-18 11:43:00    17
2018-07-18 09:38:00    17
2018-11-23 07:16:00    17
                       ..
2010-01-01 00:30:00     1
2010-01-01 00:27:00     1
2010-01-01 00:26:00     1
2010-01-01 00:14:00     1
2010-01-01 00:09:00     1
Name: count, Length: 4136496, dtype: int64


Unique values in amount:
 amount
$80.00      132115
$100.00     128867
$60.00      101821
$120.00      81083
$40.00       56856
             ...  
$389.29          1
$1086.70         1
$924.98          1
$644.69          1
$452.77          1
Name: count, Length: 81161, dtype: int64


Unique values in use_chip:
 use_chip
Swipe Transaction     6967185
Chip Transaction      4780818
Online Transaction    1557912
Name: count, dtype: int64


Unique values in merchant_city:
 merchant_city
ONLINE           1563700
Houston           146917
Miami              87388
Brooklyn           84020
Los Angeles        82004
                  ...   
Bur