# Generate Credit Card Transactions
**This notebook generates credit card transactions and randomly injects fraud chain attacks.**

**THIS NOTEBOOK CAN BE RUN IN PARALLEL WITH `1_setup.ipynb`**


### Background
This notebook generates random credit card transactions for 10K users over a period of 5 months. In an ideal scenario, these historical transactions would be accumulated into a data lake/store for batch processing so as to derive insights and analytics about this data. Credit card numbers can be bought in bulk on the dark web through previous leaks or hacks of organizations that store this sensitive data. Fraudsters will buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud chain attacks typically happen in a short time frame and can be easily spotted amongst historical transactions. This is because the velocity of transactions during the attack significantly differs from that of cardholder’s usual spending pattern. This notebook is optional to run. The generated data already exists in the `./data` folder for you to use. Re-run this notebook if you desire to re-populate fresh data or understand the whole process of how this dataset was generated.

### Setup

#### Prerequisites 

In [None]:
#!pip install Faker

#### Imports 

In [1]:
from collections import defaultdict
from faker import Faker
import pandas as pd
import numpy as np
import datetime
import hashlib
import random
import math
import os

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
13,application_1616513762404_0002,pyspark,idle,Link,Link


SparkSession available as 'spark'.


#### Seed for Reproducibility

In [2]:
faker = Faker()
faker.seed_locale('en_US', 0)

In [3]:
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

#### Constants 

In [4]:
TOTAL_UNIQUE_TRANSACTIONS = 54000 #5400000 # 5.4 Million
TOTAL_UNIQUE_USERS = 100 #10000
START_DATE = '2020-01-01 00:00:00'
END_DATE = '2020-06-01 00:01:00'
DATE_FORMAT = '%Y-%m-%d %H:%M:%S'

### Generate Transactions

#### Generate Unique Credit Card Numbers 
<p> Credit card numbers are uniquely assigned to users. Since, there are 10K users, we would want to generate 10K unique card numbers.</p>

In [5]:
def generate_unique_credit_card_numbers(n: int) -> list:
    cc_ids = set()
    for _ in range(n):
        cc_id = faker.credit_card_number(card_type='visa')
        cc_ids.add(cc_id)
    return list(cc_ids) 

In [6]:
credit_card_numbers = generate_unique_credit_card_numbers(TOTAL_UNIQUE_USERS)

In [7]:
assert len(credit_card_numbers) == TOTAL_UNIQUE_USERS 
assert len(credit_card_numbers[0]) == 16 # validate if generated number is 16-digit

In [8]:
# inspect random sample of credit card numbers 
random.sample(credit_card_numbers, 5)

['4333652104565302', '4858129660270572', '4106727807825537', '4997591057565538', '4762714231199452']

#### Generate Time Series
<p>Generate 5.4 Million random timestamps spread across a period of 5 months (2020-01-01 to 2020-06-01) in sorted order.</p>
<b>Note:</b> The timestamps are NOT unique themselves. We can have 2 or more transactions occurring at the same time coming from different users. 

In [9]:
def generate_timestamps(n: int) -> list:
    start = datetime.datetime.strptime(START_DATE, DATE_FORMAT)
    end = datetime.datetime.strptime(END_DATE, DATE_FORMAT)
    timestamps = list()
    for _ in range(n):
        timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime(DATE_FORMAT)
        timestamps.append(timestamp)
    timestamps = sorted(timestamps)
    return timestamps

In [10]:
timestamps = generate_timestamps(TOTAL_UNIQUE_TRANSACTIONS)

In [11]:
assert len(timestamps) == TOTAL_UNIQUE_TRANSACTIONS

In [12]:
# inspect random sample of timestamps
random.sample(timestamps, 5)

['2020-02-19 01:47:28', '2020-01-20 14:13:14', '2020-01-07 19:33:43', '2020-03-10 17:48:42', '2020-04-09 01:14:42']

#### Generate Random Transaction Amounts 
<p>The transaction amounts are presumed to follow Pareto distribution, as it is logical for consumers to make many more smaller purchases than large ones. The break down of the distribution is shown in the table below.</p>


| Percentage        | Range (Amount in $)     |
| :-------------: | :----------: |
|  5\% | 0.01 to 1    |
| 7.5\%   | 1 to 10 |
| 52.5\%   | 10 to 100 |
| 25\%   | 100 to 1000 |
| 10\%   | 1000 to 10000 |

In [13]:
def get_random_transaction_amount(start: float, end: float) -> float:
    amt = round(np.random.uniform(start, end), 2)
    return amt

In [14]:
distribution_percentages = {0.05: (0.01, 1.01), 
                            0.075: (1, 11.01),
                            0.525: (10, 100.01),
                            0.25: (100, 1000.01),
                            0.10: (1000, 10000.01)}

In [15]:
amounts = []

for percentage, span in distribution_percentages.items():
    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)
    start, end = span
    for _ in range(n):
        amounts.append(get_random_transaction_amount(start, end+1))
        
random.shuffle(amounts)

In [16]:
assert len(amounts) == TOTAL_UNIQUE_TRANSACTIONS

In [17]:
# inspect random sample of transaction amounts
random.sample(amounts, 5)

[88.21, 48.65, 66.05, 1.39, 7447.47]

#### Generate Credit Card Transactions
<br>
<div style="text-align: justify">
Using the random credit card numbers, timestamps and transaction amounts generated in the above steps, 
we can generate random credit card transactions by combining them. The transaction id for the transaction is the md5
hash of the above mentioned entities.
</div>

In [18]:
def generate_transaction_id(timestamp: str, credit_card_number: str, transaction_amount: float) -> str:
    hashable = f'{timestamp}{credit_card_number}{transaction_amount}'
    hexdigest = hashlib.md5(hashable.encode('utf-8')).hexdigest()
    return hexdigest

In [19]:
transactions = []
for timestamp, amount in zip(timestamps, amounts):
    credit_card_number = random.choice(credit_card_numbers)
    transaction_id = generate_transaction_id(timestamp, credit_card_number, amount)
    transactions.append({'tid': transaction_id, 
                         'datetime': timestamp, 
                         'cc_num': credit_card_number, 
                         'amount': amount, 
                         'fraud_label': 0})

In [20]:
assert len(transactions) == TOTAL_UNIQUE_TRANSACTIONS

In [21]:
# inspect random sample of credit card transactions
random.sample(transactions, 1)

[{'tid': 'dea73358f17e79a7ea8e59323b35692d', 'datetime': '2020-04-26 23:19:46', 'cc_num': '4609072304828342', 'amount': 90.69, 'fraud_label': 0}]

### Inject Fradulent Transactions
<p> A typical fraud chain looks like the one as shown in the image below.</p>
<img src="./images/fraud_pattern.png" />

In [22]:
FRAUD_RATIO = 0.0025 # percentage of transactions that are fraudulent
NUMBER_OF_FRAUDULENT_TRANSACTIONS = int(FRAUD_RATIO * TOTAL_UNIQUE_TRANSACTIONS)
ATTACK_CHAIN_LENGTHS = [3, 4, 5, 6, 7, 8, 9, 10]

#### Create Transaction Chains 

In [23]:
visited = set()
chains = defaultdict(list)

In [24]:
def size(chains: dict) -> int:
    counts = {key: len(values)+1 for (key, values) in chains.items()}
    return sum(counts.values())

In [25]:
def create_attack_chain(i: int):
    chain_length = random.choice(ATTACK_CHAIN_LENGTHS)
    for j in range(1, chain_length):
        if i+j not in visited:
            if size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS:
                break
            chains[i].append(i+j)
            visited.add(i+j)

In [26]:
while size(chains) < NUMBER_OF_FRAUDULENT_TRANSACTIONS:
    i = random.choice(range(TOTAL_UNIQUE_TRANSACTIONS))
    if i not in visited:
        create_attack_chain(i)
        visited.add(i)

In [27]:
assert size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS

#### Modify Transactions with Fraud Chain Attacks 

In [28]:
def generate_timestamps_for_fraud_attacks(timestamp: str, chain_length: int) -> list:
    timestamps = []
    timestamp = datetime.datetime.strptime(timestamp, DATE_FORMAT)
    for _ in range(chain_length):
        # interval in seconds between fraudulent attacks
        delta = random.randint(30, 120)
        current = timestamp + datetime.timedelta(seconds=delta)
        timestamps.append(current.strftime(DATE_FORMAT))
        timestamp = current
    return timestamps 

In [29]:
def generate_amounts_for_fraud_attacks(chain_length: int) -> list:
    amounts = []
    for percentage, span in distribution_percentages.items():
        n = math.ceil(chain_length * percentage)
        start, end = span
        for _ in range(n):
            amounts.append(get_random_transaction_amount(start, end+1))
    return amounts[:chain_length]

In [30]:
for key, chain in chains.items():
    transaction = transactions[key]
    timestamp = transaction['datetime']
    cc_num = transaction['cc_num']
    amount = transaction['amount']
    transaction['fraud_label'] = 1
    inject_timestamps = generate_timestamps_for_fraud_attacks(timestamp, len(chain))
    inject_amounts = generate_amounts_for_fraud_attacks(len(chain))
    random.shuffle(inject_amounts)
    for i, idx in enumerate(chain):
            original_transaction = transactions[idx]
            inject_timestamp = inject_timestamps[i]
            original_transaction['datetime'] = inject_timestamp
            original_transaction['fraud_label'] = 1
            original_transaction['cc_num'] = cc_num
            original_transaction['amount'] = inject_amounts[i]
            original_transaction['tid'] = generate_transaction_id(inject_timestamp, cc_num, amount)
            transactions[idx] = original_transaction

In [None]:
transactions

#### Transform Transactions to Pandas DataFrame

In [32]:
transactions_df = spark.createDataFrame(transactions)



In [33]:
transactions_df.show()

+-------+----------------+-------------------+-----------+--------------------+
| amount|          cc_num|           datetime|fraud_label|                 tid|
+-------+----------------+-------------------+-----------+--------------------+
|   74.6|4260567335033291|2020-01-01 00:01:52|          0|d9c721f01208c06d4...|
|2377.22|4925013053127624|2020-01-01 00:05:16|          0|22ec689978bac3643...|
|   40.6|4815447301191763|2020-01-01 00:06:17|          0|fab03ad2558fdb6ed...|
|  71.28|4985024436370614|2020-01-01 00:08:33|          0|430315d94fd88407a...|
|  63.23|4170840302810671|2020-01-01 00:21:12|          0|f8f3eac83e752149c...|
|  826.2|4106727807825537|2020-01-01 00:30:11|          0|7193119d65985c577...|
|  92.59|4307553668184625|2020-01-01 00:30:36|          0|219ce53c56ffbe351...|
| 506.02|4811343280984688|2020-01-01 00:39:22|          0|a430739641ce32a05...|
|  73.31|4159210768503456|2020-01-01 01:01:04|          0|91257427bbf3c4472...|
| 675.41|4802174255861762|2020-01-01 01:

In [35]:
fraud_transactions = transactions_df.where(transactions_df.fraud_label==1)
fraud_transactions.show()

+-------+----------------+-------------------+-----------+--------------------+
| amount|          cc_num|           datetime|fraud_label|                 tid|
+-------+----------------+-------------------+-----------+--------------------+
|  11.19|4208317936968510|2020-01-03 09:42:56|          1|5ea917b8ff8aae6b6...|
|   6.54|4208317936968510|2020-01-03 09:43:51|          1|85693516f091639b7...|
|   1.74|4208317936968510|2020-01-03 09:44:59|          1|9daf9c4ee744c7b4c...|
| 437.15|4526094550580419|2020-01-04 21:04:59|          1|d0488113c092404ed...|
|  11.15|4526094550580419|2020-01-04 21:06:40|          1|61470b5c5da54e01e...|
|   1.49|4526094550580419|2020-01-04 21:07:40|          1|111dee906496d5141...|
|8122.11|4880931623427294|2020-01-05 20:09:51|          1|36bb9de43d06c88cf...|
|  86.03|4880931623427294|2020-01-05 20:10:43|          1|e752ad9d0e95ac993...|
|   1.35|4880931623427294|2020-01-05 20:11:36|          1|6afd7c46bd26381a0...|
|  17.91|4880931623427294|2020-01-05 20:

In [37]:
fraud_transactions.count()

135

In [38]:
assert fraud_transactions.count() == NUMBER_OF_FRAUDULENT_TRANSACTIONS

### Save Generated Data
<p> The generated raw transactions data will be used by the next step = SageMaker PySpark Processing Job to do aggregations on the raw data columns and derive new features which are useful for model training in the later steps.
The generated data is saved locally and then copied to S3 bucket.</p>

#### Save Transactions Data to Resources Folder

In [39]:
transactions_df.write.mode("overwrite").option("header", "true").option("delimiter",",").csv("hdfs:///Projects/realtime/Resources/transactions.csv")