# Transaction Data Generator
This notebook uses CTGAN (Conditional Tabular Generative Adversarial Network) to generate simulated transaction data for use in a transaction classifier.

While GANs have become popular for image generation, they struggle with tabular data because the categorical and numeric data types of tabular datasets are not well suited to the GAN architecture. CTGAN resolves this by introducing Mode-Specific Normalization to better model the distribution of features in tabular datasets that are not well represented by a Gaussian-like distribution (as pixels in an image tend to be). CTGAN also introduces the Conditional Generator to overcome the sparsity of one-hot-encoded vectors and class impbalance in categorical features typical of real-world data.

This notebook leverages [The Synthetic Data Vault Project's CTGAN implementation](https://github.com/sdv-dev/CTGAN) to generate synthetic data.

In [None]:
!pip install sdv -q

In [None]:
import os
import pandas as pd
import sdv
from sdv.metadata import Metadata
from sdv.single_table import CTGANSynthesizer
from rdt.transformers import UniformEncoder
from sklearn.utils import shuffle

print(sdv.version.public)

1.20.0


Please upload a .pkl transactions file to the session storage and update filename below. The dataframe should follow the Plaid transaction data model (see below).

The dataframe should have columns `['id', 'amount', 'date', 'isRecurring',
'createdAt', 'updatedAt', 'merchant_name' 'cat_label', 'cat_group']`.

In [None]:
FNAME = 'raw-transactions.csv'

In [None]:
# Read data
df = pd.read_csv(FNAME)

# Filter negative and positive transactions
filt = df['amount'] >= 0
df_credit = df.loc[filt]
df_debit = df.loc[~filt]


In [None]:
metadata = Metadata.detect_from_dataframe(
    data=df,
    table_name='transactions')

In [None]:
def synthesize_data(df, metadata):
  synthesizer = CTGANSynthesizer(
      metadata,
      enforce_rounding=True,
      epochs=300,
      verbose=True,
      cuda=True
  )
  synthesizer.auto_assign_transformers(df)

  synthesizer.update_transformers(
    column_name_to_transformer={
        'merchant_name': UniformEncoder(),
        'cat_label': UniformEncoder()
    }
  )

  synthesizer.fit(df)

  return synthesizer


In [None]:
synthesize_credits = synthesize_data(df_credit, metadata)
synthesize_debits = synthesize_data(df_debit, metadata)


We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.


Replacing the default transformer for column 'merchant_name' might impact the quality of your synthetic data.


Replacing the default transformer for column 'cat_label' might impact the quality of your synthetic data.

Gen. (-0.86) | Discrim. (-0.10): 100%|██████████| 300/300 [00:16<00:00, 18.21it/s]

Replacing the default transformer for column 'merchant_name' might impact the quality of your synthetic data.


Replacing the default transformer for column 'cat_label' might impact the quality of your synthetic data.

Gen. (-0.59) | Discrim. (-0.09): 100%|██████████| 300/300 [01:55<00:00,  2.60it/s]


In [None]:
fig = synthesize_debits.get_loss_values_plot()
fig.show()

In [None]:
debit_ratio = len(df_debit) / len(df)
credit_ratio = 1 - debit_ratio
n = 100000
num_debits = round(n * debit_ratio)
num_credits = round(n * credit_ratio)
synthetic_debits = synthesize_debits.sample(num_rows=num_debits)
synthetic_credits = synthesize_credits.sample(num_rows=num_credits)
synthetic_data = pd.concat([synthetic_debits, synthetic_credits])
synthetic_data = shuffle(synthetic_data)
synthetic_data.head()

Unnamed: 0,id,amount,date,isRecurring,createdAt,updatedAt,merchant_name,cat_label
41114,182726263294695936,-21.443589,2024-05-16,True,2024-08-01T08:07:46.240244,2024-04-16T14:08:40.604315,Amazon,guilt_free
36,164069209552231200,124.348645,2018-11-08,False,2023-12-16T22:50:01.739708,2023-12-16T23:29:35.544304,C34030 ENVIRONME DIR DEP 210714,bills_utilities
13066,164069205392529440,-26.686479,2023-08-14,False,2023-12-16T22:50:02.137801,2023-12-16T23:29:35.540009,LA NUEVA CANTINA,guilt_free
2313,164069201972074912,121.09302,2019-07-10,False,2023-12-16T22:50:01.739708,2023-12-16T23:29:35.544304,C Environme Dir,income
36128,164069205392529856,-29.322946,2019-07-13,False,2023-12-16T22:50:02.137801,2023-12-16T23:29:35.540009,Costco,guilt_free


In [None]:
from sdv.evaluation.single_table import run_diagnostic, evaluate_quality
from sdv.evaluation.single_table import get_column_plot

# 1. perform basic validity checks
diagnostic = run_diagnostic(df, synthetic_data, metadata)

# 2. measure the statistical similarity
quality_report = evaluate_quality(df, synthetic_data, metadata)

# 3. plot the data
fig = get_column_plot(
    real_data=df,
    synthetic_data=synthetic_data,
    metadata=metadata,
    column_name='cat_label'
)

fig.show()

Generating report ...

(1/2) Evaluating Data Validity: |██████████| 8/8 [00:00<00:00, 369.72it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 270.39it/s]|
Data Structure Score: 60.0%

Overall Score (Average): 80.0%

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 8/8 [00:00<00:00, 32.59it/s]|
Column Shapes Score: 80.49%

(2/2) Evaluating Column Pair Trends: |██████████| 28/28 [00:00<00:00, 101.23it/s]|
Column Pair Trends Score: 42.9%

Overall Score (Average): 61.69%



In [None]:
synthetic_data.to_csv('synth-transactions.csv')