# Generate Credit Card Transactions
**This notebook generates credit card transactions and randomly injects fraud chain attacks.**

**THIS NOTEBOOK CAN BE RUN IN PARALLEL WITH `1_setup.ipynb`**

**Recommended settings to run this notebook in SageMaker Studio:**

- Image: Data Science
- Kernel: Python3
- Instance type: <font color='blue'>ml.m5.large (2 vCPU + 8 GiB)</font>

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Generate Transactions](#Generate-Transactions)
1. [Inject Fradulent Transactions](#Inject-Fradulent-Transactions)
1. [Save Generated Data](#Save-Generated-Data)

### Background
This notebook generates random credit card transactions for 10K users over a period of 5 months. In an ideal scenario, these historical transactions would be accumulated into a data lake/store for batch processing so as to derive insights and analytics about this data. Credit card numbers can be bought in bulk on the dark web through previous leaks or hacks of organizations that store this sensitive data. Fraudsters will buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud chain attacks typically happen in a short time frame and can be easily spotted amongst historical transactions. This is because the velocity of transactions during the attack significantly differs from that of cardholder’s usual spending pattern. This notebook is optional to run. The generated data already exists in the `./data` folder for you to use. Re-run this notebook if you desire to re-populate fresh data or understand the whole process of how this dataset was generated.

### Setup

#### Prerequisites 

In [3]:
!pip install Faker confluent-kafka

Collecting Faker
  Downloading Faker-36.1.1-py3-none-any.whl.metadata (15 kB)
Collecting confluent-kafka
  Downloading confluent_kafka-2.8.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (22 kB)
Downloading Faker-36.1.1-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m105.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading confluent_kafka-2.8.2-cp311-cp311-manylinux_2_28_x86_64.whl (3.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m102.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Faker, confluent-kafka
Successfully installed Faker-36.1.1 confluent-kafka-2.8.2


# clickstream dummy data

In [None]:
import json
import time
import random
from faker import Faker
from confluent_kafka import Producer

# Initialize Faker and Kafka Producer
fake = Faker()
producer_config = {
    'bootstrap.servers': 'localhost:9092'  # Replace with your Kafka broker address
}
producer = Producer(producer_config)

# Kafka topic to which events will be sent
topic = 'customer-events'

# Possible event types
event_types = ['page_view', 'click', 'add_to_cart', 'purchase']

def delivery_report(err, msg):
    """Called once for each message produced to indicate delivery result."""
    if err is not None:
        print(f'Message delivery failed: {err}')
    else:
        print(f'Message delivered to {msg.topic()} [{msg.partition()}]')

def generate_dummy_event():
    """Generates a dummy customer interaction event."""
    return {
        'event_id': fake.uuid4(),
        'timestamp': fake.iso8601(),
        'customer_id': random.randint(1, 1000),
        'session_id': fake.uuid4(),
        'event_type': random.choice(event_types),
        'product_id': random.randint(1, 500),
        'product_category': random.choice(['electronics', 'fashion', 'home', 'books', 'toys']),
        'price': round(random.uniform(10.0, 500.0), 2)
    }

# # Continuously produce events
# while True:
#     event = generate_dummy_event()
#     print(event)
#     # Convert event to JSON and produce to Kafka topic
#     producer.produce(topic, json.dumps(event).encode('utf-8'), callback=delivery_report)
#     # Poll to handle delivery reports (non-blocking)
#     producer.poll(0)
#     # Wait a bit before sending the next event (adjust frequency as needed)
#     time.sleep(1)

%6|1740980219.214|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 0ms in state APIVERSION_QUERY)
%3|1740980219.446|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv6#[::1]:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1740980220.217|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv6#[::1]:9092 failed: Connection refused (after 2ms in state CONNECT, 1 identical error(s) suppressed)
%6|1740980221.222|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (se

# CC rest

#### Imports 

In [4]:
from botocore.client import ClientError
from collections import defaultdict
from faker import Faker
import pandas as pd
import numpy as np
import sagemaker
import datetime
import hashlib
import random
import boto3
import math
import os

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


#### Seed for Reproducibility

In [5]:
faker = Faker()
faker.seed_locale('en_US', 0)

In [6]:
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

#### Constants 

In [7]:
TOTAL_UNIQUE_TRANSACTIONS = 5400000 # 5.4 Million
TOTAL_UNIQUE_USERS = 10000
BUCKET = sagemaker.Session().default_bucket()

### Generate Transactions

#### Generate Unique Credit Card Numbers 
<p> Credit card numbers are uniquely assigned to users. Since, there are 10K users, we would want to generate 10K unique card numbers.</p>

In [8]:
def generate_unique_credit_card_numbers(n: int) -> list:
    cc_ids = set()
    for _ in range(n):
        cc_id = faker.credit_card_number(card_type='visa')
        cc_ids.add(cc_id)
    return list(cc_ids) 

In [9]:
credit_card_numbers = generate_unique_credit_card_numbers(TOTAL_UNIQUE_USERS)

In [10]:
assert len(credit_card_numbers) == 10000 
assert len(credit_card_numbers[0]) == 16 # validate if generated number is 16-digit

In [11]:
# inspect random sample of credit card numbers 
random.sample(credit_card_numbers, 5)

['4272687935162831',
 '4219523862300043',
 '4894339437990149',
 '4156358307182747',
 '4715825935231868']

#### Generate Time Series
<p>Generate 5.4 Million random timestamps spread across a period of 5 months (2022-01-01 to 2022-06-01) in sorted order.</p>
<b>Note:</b> The timestamps are NOT unique themselves. We can have 2 or more transactions occurring at the same time coming from different users. 

In [12]:
def generate_timestamps(n: int) -> list:
    start = datetime.datetime.strptime('2022-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
    end = datetime.datetime.strptime('2022-06-01 00:01:00', '%Y-%m-%d %H:%M:%S')
    timestamps = list()
    for _ in range(n):
        timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime('%Y-%m-%d %H:%M:%S')
        timestamps.append(timestamp)
    timestamps = sorted(timestamps)
    return timestamps

In [13]:
timestamps = generate_timestamps(TOTAL_UNIQUE_TRANSACTIONS)

In [14]:
assert len(timestamps) == TOTAL_UNIQUE_TRANSACTIONS

In [15]:
# inspect random sample of timestamps
random.sample(timestamps, 5)

['2022-01-26 06:47:20',
 '2022-01-09 22:43:37',
 '2022-03-30 21:46:15',
 '2022-05-06 18:34:29',
 '2022-05-12 21:21:13']

#### Generate Random Transaction Amounts 
<p>The transaction amounts are presumed to follow Pareto distribution, as it is logical for consumers to make many more smaller purchases than large ones. The break down of the distribution is shown in the table below.</p>


| Percentage        | Range (Amount in $)     |
| :-------------: | :----------: |
|  5\% | 0.01 to 1    |
| 7.5\%   | 1 to 10 |
| 52.5\%   | 10 to 100 |
| 25\%   | 100 to 1000 |
| 10\%   | 1000 to 10000 |

In [16]:
def get_random_transaction_amount(start: float, end: float) -> float:
    amt = round(np.random.uniform(start, end), 2)
    return amt

In [17]:
distribution_percentages = {0.05: (0.01, 1.01), 
                            0.075: (1, 11.01),
                            0.525: (10, 100.01),
                            0.25: (100, 1000.01),
                            0.10: (1000, 10000.01)}

In [18]:
amounts = []

for percentage, span in distribution_percentages.items():
    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)
    start, end = span
    for _ in range(n):
        amounts.append(get_random_transaction_amount(start, end+1))
        
random.shuffle(amounts)

In [19]:
assert len(amounts) == TOTAL_UNIQUE_TRANSACTIONS

In [20]:
# inspect random sample of transaction amounts
random.sample(amounts, 5)

[2700.62, 0.3, 79.26, 466.93, 63.75]

#### Generate Credit Card Transactions
<br>
<div style="text-align: justify">
Using the random credit card numbers, timestamps and transaction amounts generated in the above steps, 
we can generate random credit card transactions by combining them. The transaction id for the transaction is the md5
hash of the above mentioned entities.
</div>

In [21]:
def generate_transaction_id(timestamp: str, credit_card_number: str, transaction_amount: float) -> str:
    hashable = f'{timestamp}{credit_card_number}{transaction_amount}'
    hexdigest = hashlib.md5(hashable.encode('utf-8')).hexdigest()
    return hexdigest

In [22]:
transactions = []
for timestamp, amount in zip(timestamps, amounts):
    credit_card_number = random.choice(credit_card_numbers)
    transaction_id = generate_transaction_id(timestamp, credit_card_number, amount)
    transactions.append({'tid': transaction_id, 
                         'datetime': timestamp, 
                         'cc_num': credit_card_number, 
                         'amount': amount, 
                         'fraud_label': 0})

In [23]:
assert len(transactions) == TOTAL_UNIQUE_TRANSACTIONS

In [24]:
# inspect random sample of credit card transactions
random.sample(transactions, 1)

[{'tid': '7e5b8245bc1b47334b60d20b4d99e557',
  'datetime': '2022-02-26 11:19:04',
  'cc_num': '4991882247398880',
  'amount': 81.6,
  'fraud_label': 0}]

### Inject Fradulent Transactions
<p> A typical fraud chain looks like the one as shown in the image below.</p>

![SegmentLocal](images/fraud_pattern.png "connection")

In [25]:
FRAUD_RATIO = 0.0025 # percentage of transactions that are fraudulent
NUMBER_OF_FRAUDULENT_TRANSACTIONS = int(FRAUD_RATIO * TOTAL_UNIQUE_TRANSACTIONS)
ATTACK_CHAIN_LENGTHS = [3, 4, 5, 6, 7, 8, 9, 10]

#### Create Transaction Chains 

In [26]:
visited = set()
chains = defaultdict(list)

In [27]:
def size(chains: dict) -> int:
    counts = {key: len(values)+1 for (key, values) in chains.items()}
    return sum(counts.values())

In [28]:
def create_attack_chain(i: int):
    chain_length = random.choice(ATTACK_CHAIN_LENGTHS)
    for j in range(1, chain_length):
        if i+j not in visited:
            if size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS:
                break
            chains[i].append(i+j)
            visited.add(i+j)

In [29]:
while size(chains) < NUMBER_OF_FRAUDULENT_TRANSACTIONS:
    i = random.choice(range(TOTAL_UNIQUE_TRANSACTIONS))
    if i not in visited:
        create_attack_chain(i)
        visited.add(i)

In [30]:
assert size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS

#### Modify Transactions with Fraud Chain Attacks 

In [31]:
def generate_timestamps_for_fraud_attacks(timestamp: str, chain_length: int) -> list:
    timestamps = []
    timestamp = datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
    for _ in range(chain_length):
        # interval in seconds between fraudulent attacks
        delta = random.randint(30, 120)
        current = timestamp + datetime.timedelta(seconds=delta)
        timestamps.append(current.strftime('%Y-%m-%d %H:%M:%S'))
        timestamp = current
    return timestamps 

In [32]:
def generate_amounts_for_fraud_attacks(chain_length: int) -> list:
    amounts = []
    for percentage, span in distribution_percentages.items():
        n = math.ceil(chain_length * percentage)
        start, end = span
        for _ in range(n):
            amounts.append(get_random_transaction_amount(start, end+1))
    return amounts[:chain_length]

In [33]:
for key, chain in chains.items():
    transaction = transactions[key]
    timestamp = transaction['datetime']
    cc_num = transaction['cc_num']
    amount = transaction['amount']
    transaction['fraud_label'] = 1
    inject_timestamps = generate_timestamps_for_fraud_attacks(timestamp, len(chain))
    inject_amounts = generate_amounts_for_fraud_attacks(len(chain))
    random.shuffle(inject_amounts)
    for i, idx in enumerate(chain):
            original_transaction = transactions[idx]
            inject_timestamp = inject_timestamps[i]
            original_transaction['datetime'] = inject_timestamp
            original_transaction['fraud_label'] = 1
            original_transaction['cc_num'] = cc_num
            original_transaction['amount'] = inject_amounts[i]
            original_transaction['tid'] = generate_transaction_id(inject_timestamp, cc_num, amount)
            transactions[idx] = original_transaction

#### Transform Transactions to Pandas DataFrame

In [34]:
transactions_df = pd.DataFrame(transactions)

In [35]:
fraud_transactions = transactions_df[transactions_df.fraud_label.eq(1)]
fraud_transactions.head()

Unnamed: 0,tid,datetime,cc_num,amount,fraud_label
9273,facdcb53c2f121adb61bb5303a2668b4,2022-01-01 06:07:42,4084695864193139,6.33,1
9274,d8275de4398389c654c3419db869b39d,2022-01-01 06:08:18,4084695864193139,4.58,1
9275,77fe2f48cb658fd76a063042fc19b696,2022-01-01 06:09:44,4084695864193139,53.21,1
9276,cb66afc413fac12580b2fd692b593cba,2022-01-01 06:11:30,4084695864193139,0.94,1
9277,810f557eff7c03a814eeeebd0b29fe1d,2022-01-01 06:12:17,4084695864193139,51.41,1


In [36]:
assert fraud_transactions.count()[0] == NUMBER_OF_FRAUDULENT_TRANSACTIONS

  assert fraud_transactions.count()[0] == NUMBER_OF_FRAUDULENT_TRANSACTIONS


### Save Generated Data
<p> The generated raw transactions data will be used by the next step = SageMaker PySpark Processing Job to do aggregations on the raw data columns and derive new features which are useful for model training in the later steps.
The generated data is saved locally and then copied to S3 bucket.</p>

#### Save Transactions Data to Local Folder ./data and upload to S3

In [37]:
data_dir = os.path.join(os.getcwd(), 'data/raw')
os.makedirs(data_dir, exist_ok=True)

In [38]:
transactions_df.to_csv(f'{data_dir}/transactions.csv', index=False)
transactions_df.to_csv(f's3://{BUCKET}/raw/transactions.csv', index=False)