# U.S. Credit Card Transaction Synthetic Dataset
Open-source financial datasets often suffer from encoding, making them hard to interpret and analyze. A significant limitation I faced when using a synthetic dataset from Kaggle was the lack of variety in merchants. This resulted in misclassifications of specific transactions. For example, Sam's Club was categorized as a restaurant instead of retail, while Costco (a direct competitor) was correctly classified as such. This issue arose because Sam's Club was not included in the dataset.

To address these challenges, I created a synthetic dataset to improve the accuracy of my original model. It will include a wide variety of merchants to enhance the ability to generalize to unseen data. It will also be easier to interpret with clear, meaningful columns that align with real-world U.S. credit card transactions.

## Notebook Outline
1. Importing Libraries
2. Dataset Schemas and Helper Functions.
3. Generating the Synthetic Dataset.
4. Save the dataset to a CSV file.

## 1. Importing Libraries

We import the necessary libraries for generating and exploring the synthetic data.

In [1]:
# Imports and Setup

import pandas as pd
import random
import uuid
from faker import Faker
from datetime import datetime, timedelta

# Initialize Faker for realistic data generation
fake = Faker()
Faker.seed(42)
random.seed(42)

## 2. Dataset Schemas and Helper Functions
| **Column Name**        | **Description**                                                                                      |
|-------------------------|----------------------------------------------------------------------------------------------------|
| `transaction_id`        | Unique identifier for each transaction.                                                            |
| `customer_id`           | Unique identifier for the customer.                                                               |
| `card_number`           | Last 4 digits of the customer's credit card.                                                      |
| `timestamp`             | Date and time of the transaction.                                                                 |
| `merchant_category`     | Category of the merchant (e.g., Groceries, Dining, Travel).                                       |
| `merchant`              | Name of the merchant (e.g., Walmart, Amazon, Starbucks).                                          |
| `amount`                | Transaction amount in USD.                                                                        |
| `card_provider`         | Issuing company (e.g., Visa, Mastercard, Discover, American Express).                             |
| `card_present`          | Whether the card was physically present (Boolean: `True` or `False`).                             |
| `device`                | Device used for the transaction (e.g., Mobile, Desktop, POS).                                     |
| `channel`               | Transaction channel (e.g., In-store, Online, Mobile App).                                         |

We will also utilize helper functions to generate random transaction IDs, customer IDs, and simulate realistic spending patterns.

In [31]:
# Testing helper functions for generating random data
def generate_transaction_id():
    return str(uuid.uuid4())

def generate_customer_id():
    return f"CUST-{random.randint(1000, 9999)}"

def generate_card_number():
    return f"{random.randint(1000, 9999)}"

# Example of function usage
print(generate_transaction_id(), generate_customer_id(), generate_card_number())

0b00e337-bbb8-40e9-bdf5-50bdbb88d69f CUST-5453 4312


## 3. Generating the Synthetic Dataset

This code will create a synthetic dataset containing 50,000 transactions. Each transaction will include:
- Merchant name and category.
- Transaction amount.
- Whether the card was physically present, etc.

In [32]:
# Dictonary of real merchants for generating realistic data
# TODO: Add more merchants to each category
# TODO: Transition to using a database for merchant data
categories = {
    'Groceries': {
        'limits': (10, 200),
        'merchants': ['Walmart', 'Trader Joe\'s', 'Kroger', 'Safeway', 'Whole Foods', 'Publix', 'H-E-B', 'Aldi', 'Wegmans', 'Meijer', 'Sprouts', 'WinCo Foods', 'Flea Market', 'Farmers Market', 'Groceries', 'Grocery Outlet', 'Grocery Store']},
    'Fast Food': {
        'limits': (5, 50),
        'merchants': ['McDonald\'s', 'Starbucks', 'Subway', 'Chipotle', 'Taco Bell', 'Chick-fil-A', 'In-N-Out', 'Raising Cane\'s', 'Burger King', 'Panda Express', 'Culvers', 'Dunkin\'', 'Sonic', 'Wingstop', 'Jersey Mike\'s', 'Shake Shack', 'Papa John\'s', 'Little Ceasar\'s', 'Domino\'s', 'Pizza Hut', 'Pizza', 'Chicken Nuggets', 'Burgers', 'Fries', 'Fast Food']},
    'Restaurants': {
        'limits': (20, 250),
        'merchants': ['Olive Garden', 'Chili\'s', 'Red Lobster', 'Applebee\'s', 'Cheesecake Factory', 'Texas Roadhouse', 'Outback Steakhouse', 'Buffalo Wild Wings', 'Denny\'s', 'IHOP', 'Cracker Barrel', 'Golden Corral', 'Red Robin', 'The Melting Pot', 'Dave & Buster\'s', 'Ruth\'s Chris Steakhouse', 'Nobu', 'Capital Grille', 'Benihana\'s', 'Buffet', 'Steak', 'Wine', 'Restaurant', 'Dining', 'Dinner', 'Lunch', 'Breakfast']},
    'Airlines':{
        'limits': (150, 2000),
        'merchants': ['United Airlines', 'United', 'Delta Airlines', 'American Airlines', 'Southwest Airlines', 'Alaska Airlines', 'JetBlue', 'Hawaiian Airlines', 'Spirit Airlines', 'Frontier Airlines', 'Allegiant Air', 'Breeze Airways', 'Cape Air', 'Air Canada', 'WestJet', 'Volaris', 'Aeromexico', 'LATAM', 'Avianca', 'Copa Airlines', 'British Airways', 'Lufthansa', 'Iberia', 'Turkish Airlines', 'Air France', 'KLM', 'Qantas', 'Emirates', 'Etihad', 'Qatar Airways', 'Cathay Pacific', 'Singapore Airlines', 'Korean Air', 'ANA', 'Japan Airlines', 'Airlines', 'Airline', 'Flight', 'Airplane', 'Airport', 'Travel']},
    'Hotels': {
        'limits': (100, 2000),
        'merchants': ['Marriott', 'Hilton', 'Hyatt', 'Best Western', 'Motel 6', 'Holiday Inn', 'Fairmont', 'Four Seasons', 'La Quinta', 'Motel', 'Hotel', 'Inn']},
    'Gas': {
        'limits': (20, 200),
        'merchants': ['Chevron', 'Shell', 'Exxon', '76', 'ARCO', 'BP', 'Speedway', 'Murphy USA', 'Local Gas Station', 'Gas Station', 'Fuel', 'Gas']},
    'Retail': {
        'limits': (20, 1500),
        'merchants': ['Amazon', 'Target', 'Best Buy', 'Apple', 'DoorDash', 'Costco', 'Sam\'s Club', 'Home Depot', 'Lowe\'s', 'Nordstrom', 'Macy\'s', 'Kohl\'s', 'JCPenney', 'Sears', 'Kmart', 'Dollar Tree', 'Dollar General', 'Family Dollar', '99 Cents Only', 'Dollar Store', 'Clothes', 'Men\'s Wearhouse', 'Women\'s Wearhouse', 'REI', 'Burlington', 'Gucci', 'Louis Vuitton', 'Chanel', 'Prada', 'Saks Fifth Avenue', 'TJ Maxx', 'Ross', 'H&M', 'Uniqlo', 'eBay', 'Wayfair', 'Etsy', 'SHEIN', 'Temu', 'AliExpress', 'ASOS', 'HelloFresh', 'BarkBox', 'Stitch Fix', 'InstaCart', 'Shoes', 'Electronics', 'Appliances', 'Furniture', 'Home Goods', 'Retail', 'Store', 'Shop', 'Mall', 'Outlet']},
    'Wireless': {
        'limits': (20, 300),
        'merchants': ['Verizon', 'AT&T', 'T-Mobile', 'Sprint', 'Boost Mobile', 'Cricket Wireless', 'MetroPCS', 'Wireless', 'Mint Mobile', 'Visible', 'Google Fi', 'Wireless Store', 'Wireless Shop']},
    'Utilities': {
        'limits': (50, 500),
        'merchants': ['Comcast', 'PG&E', 'SCE', 'Water Co', 'Gas', 'Electric', 'Utilities', 'Charter', 'Spectrum', 'Internet', 'Cable', 'Phone', 'Cell Phone', 'Wireless', 'Landline', 'TV', 'Streaming', 'Streaming Service', 'Streaming Platform']},
    'Health': {
        'limits': (50, 500),
        'merchants': ['Kaiser', 'Sutter Health', 'CVS', 'Walgreens', 'Rite Aid', 'Urgent Care', 'Cigna', 'Blue Cross', 'United Healthcare', 'Doctor', 'Hospital', 'Pharmacy']},
    'Entertainment': {
        'limits': (10, 250),
        'merchants': ['Netflix', 'Hulu', 'Disney+', 'HBO', 'Spotify', 'Apple Music', 'Audible', 'Crunchyroll', 'Paramount+', 'YouTube', 'Twitch', 'Steam', 'Max', 'Prime Video', 'AMC', 'Cinemark', 'Regal', 'Theater', 'Concert', 'Event', 'Ticket', 'StubHub', 'Live Nation', 'Nintendo', 'PlayStation', 'Xbox', 'GameStop', 'Game Store', 'Game Shop', 'Game', 'Entertainment', 'Movies', 'Music', 'Gaming']},
    'Transportation': {
        'limits': (5, 250),
        'merchants': ['Lyft', 'Uber', 'Taxi', 'Bus', 'Train', 'Transportation', 'Ride Share', 'BART', 'Caltrain', 'VTA', 'Enterprise', 'Hertz', 'Avis', 'Turo', 'Zipcar', 'Car Rental', 'Rental Car', 'Getaround', 'Scooter', 'Bike', 'Lime', 'Bird', 'Spin', 'Scoot', 'Via', 'Wingz', 'Curb', 'BlaBlaCar']},
    'Education': {
        'limits': (50, 5000),
        'merchants': ['UC Berkeley', 'Stanford', 'SJSU', 'Ohio State', 'UT Austin', 'University of Florida', 'Penn State', 'UPenn', 'Dartmouth', 'Cornell', 'OSU', 'Michigan', 'UCLA', 'Yale', 'Princeton', 'WashU', 'Saint Louis University', 'SLU', 'NYU', 'Harvard', 'Foothill College', 'Education', 'Books', 'School', 'Supplies', 'Tuition', 'Chegg', 'Textbooks', 'Bookstore', 'Coursera', 'Udemy', 'EdX', 'LinkedIn Learning', 'Skillshare', 'Masterclass', 'Khan Academy', 'Education Platform', 'Education Service', 'Private School', 'Public School', 'College', 'University', 'Community College', 'Trade School', 'Vocational School', 'Online School', 'Online Course', 'Online Education', 'Online Learning', 'Online Platform', 'Online Service']},
}

Helper functions.

In [33]:
# Initialize Faker for realistic data generation
fake = Faker()

# Helper functions
def generate_transaction_id():
    return str(uuid.uuid4())

def generate_customer_id():
    return f"CUST-{random.randint(1000, 9999)}"

def generate_card_number():
    return f"{random.randint(1000, 9999)}"

def generate_timestamp():
    return fake.date_time_between(start_date='-1y', end_date='now').strftime('%Y-%m-%d %H:%M:%S')

def generate_card_provider():
    return random.choice(['VISA', 'MasterCard', 'American Express', 'Discover'])

def generate_channel():
    return random.choice(['Online', 'Physical'])

def generate_device(channel):
    if channel == 'Online':
        return random.choice(['Mobile', 'Desktop', 'Tablet'])
    else:
        return random.choice(['Mobile', 'Tablet', 'Desktop', 'POS'])

def generate_category():
    return random.choice(list(categories.keys()))

def generate_amount(category):
    low, high = categories[category]['limits']
    return round(random.uniform(low, high), 2)

def generate_merchant(category):
    return random.choice(categories[category]['merchants'])


Generate a Single Transaction.

In [34]:
def generate_transaction_row():
    # Select a random category.
    category = generate_category()

    # Select a channel and device.
    channel = generate_channel()
    device = generate_device(channel)

    # Generate a single transaction record.
    return {
        'transaction_id': generate_transaction_id(),
        'customer_id': generate_customer_id(),
        'card_number': generate_card_number(),
        'timestamp': generate_timestamp(),
        'merchant_category': category,
        'merchant_name': generate_merchant(category),
        'amount': generate_amount(category),
        'card_provider': generate_card_provider(),
        'channel': channel,
        'device': device
    }

Function to generate full dataset.

In [35]:
def generate_synthetic_dataset(num_transactions=10000):
    # Generate a list of transactions.
    transactions = [generate_transaction_row() for _ in range(num_transactions)]

    # Create a DataFrame from the list of transactions.
    return pd.DataFrame(transactions)

Generate dataset.

In [36]:
df = generate_synthetic_dataset(75000)

Data Exploration.

In [37]:
print(df.head())

                         transaction_id customer_id card_number  \
0  feec7465-afb5-4de2-9c73-3a753004b084   CUST-2219        5173   
1  2273843f-df20-4532-a295-f4daed302890   CUST-1938        2099   
2  8a732724-0eb6-4a77-a3c7-1fbac59542a3   CUST-3634        2146   
3  83562abd-848f-4e03-99a6-f19a12a4c6d9   CUST-9425        7495   
4  4a52fa36-00a4-438d-af9d-446e6d1f5918   CUST-7861        1539   

             timestamp merchant_category merchant_name  amount  \
0  2024-04-10 22:19:27          Wireless        Sprint  138.02   
1  2025-01-13 16:15:58         Groceries        Kroger  168.69   
2  2024-03-07 07:08:36          Wireless  Boost Mobile   30.60   
3  2024-07-25 12:07:13         Groceries  Trader Joe's  155.86   
4  2024-12-21 15:58:34         Education     Princeton  609.30   

      card_provider   channel   device  
0              VISA  Physical      POS  
1  American Express  Physical  Desktop  
2          Discover  Physical      POS  
3              VISA  Physical   Mobi

## 4. Save Dataset to CSV.

In [38]:
df.to_csv('../data/synthetic_transactions.csv', index=False)
print("Dataset saved to CSV.")

Dataset saved to CSV.
