# Data generation through simulation of credit card transactions

## Data generation process

The simulation will consist of five main steps (follows this [book chapter](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html)):

1. **Generation of customer profiles:** Every customer is different in their spending habits. This will be simulated by defining some properties for each customer. The main properties will be their geographical location, their spending frequency, and their spending amounts. The customer properties will be represented as a table, referred to as the customer profile table.

2. **Generation of terminal profiles:** Terminal properties will simply consist of a geographical location. The terminal properties will be represented as a table, referred to as the terminal profile table.

3. **Association of customer profiles to terminals:** We will assume that customers only make transactions on terminals that are within a radius of of their geographical locations. This makes the simple assumption that a customer only makes transactions on terminals that are geographically close to their location. This step will consist of adding a feature ‘list_terminals’ to each customer profile, that contains the set of terminals that a customer can use.

4. **Generation of transactions:** The simulator will loop over the set of customer profiles, and generate transactions according to their properties (spending frequencies and amounts, and available terminals). This will result in a table of transactions.

5. **Generation of fraud scenarios:** This last step will label the transactions as legitimate or genuine. This will be done by following three different fraud scenarios.

## Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [36]:
from pathlib import Path
from src.utils import load_config

config_path = Path.cwd() / "config.yaml"
config_path.exists()

True

In [38]:
config = load_config(config_path)

## Step 1: Generate customer profiles

In [2]:
from src.data.generator import generate_customer_profiles_table

In [13]:
customer_df = generate_customer_profiles_table(n_customers=5)
customer_df.head()

Unnamed: 0,customer_id,loc_long_coord,loc_lat_coord,mean_amount,std_amount,mean_nb_tx_per_day
0,0,54.88135,71.518937,62.262521,31.13126,2.179533
1,1,42.36548,64.589411,46.570785,23.285393,3.567092
2,2,96.366276,38.344152,80.213879,40.106939,2.11558
3,3,56.804456,92.559664,11.748426,5.874213,0.348517
4,4,2.02184,83.261985,78.924891,39.462446,3.480049


In [16]:
assert len(customer_df) == 5
assert all([col in customer_df.columns for col in ["customer_id", "loc_long_coord", "loc_lat_coord", "mean_amount", "std_amount", "mean_nb_tx_per_day"]])
assert all([long >= 0 and long <= 100 for long in customer_df.loc_long_coord.values])
assert all([lat >= 0 and lat <= 100 for lat in customer_df.loc_lat_coord.values])
assert all([mean >= 5 and mean <= 100 for mean in customer_df.mean_amount.values])
assert all([std >= 2.5 and std <= 50 for std in customer_df.std_amount.values])
assert all([mean >= 0 and mean <= 4 for mean in customer_df.mean_nb_tx_per_day.values])

## Step 2: Generate terminal profiles

In [7]:
from src.data.generator import generate_terminal_profiles_table

In [9]:
terminal_df = generate_terminal_profiles_table(n_terminals=5)
terminal_df.head()

Unnamed: 0,terminal_id,loc_long_coord,loc_lat_coord
0,0,54.88135,71.518937
1,1,60.276338,54.488318
2,2,42.36548,64.589411
3,3,43.758721,89.1773
4,4,96.366276,38.344152


In [17]:
assert len(terminal_df) == 5
assert all(col in terminal_df.columns for col in ["terminal_id", "loc_long_coord", "loc_lat_coord"])
assert all([long >= 0 and long <= 100 for long in terminal_df["loc_long_coord"]])
assert all([lat >= 0 and lat <= 100 for lat in terminal_df["loc_lat_coord"]])

## Step 3: Association of customer profiles to terminals

In [18]:
from src.data.generator import get_list_terminals_within_radius

In [23]:
# We first get the geographical locations of all terminals as a numpy array
x_y_terminals = terminal_df[['loc_long_coord','loc_lat_coord']].values.astype(float)
# And get the list of terminals within radius of 50 for the last customer
available_terminals = get_list_terminals_within_radius(customer_df.iloc[4], x_y_terminals=x_y_terminals, r=50)

In [24]:
assert len(available_terminals) == 2
assert all(term in available_terminals for term in [2, 3])

In [25]:
customer_df['available_terminals']=customer_df.apply(lambda x : get_list_terminals_within_radius(x, x_y_terminals=x_y_terminals, r=50), axis=1)

## Step 4: Generation of transactions

In [27]:
from src.data.generator import generate_transactions_table

In [29]:
transaction_df=generate_transactions_table(customer_df.iloc[0], 
                                                         start_date = "2018-04-01", 
                                                         nb_days = 5)
transaction_df.head()

Unnamed: 0,tx_datetime,customer_id,terminal_id,tx_amount,tx_time_seconds,tx_time_days
0,2018-04-01 07:19:05,0,3,123.59,26345,0
1,2018-04-01 19:02:02,0,3,46.51,68522,0
2,2018-04-01 18:00:16,0,0,77.34,64816,0
3,2018-04-02 15:13:02,0,2,32.35,141182,1
4,2018-04-02 14:05:38,0,3,63.3,137138,1


In [32]:
assert all(col in transaction_df.columns for col in ["tx_time_seconds", "tx_time_days", "customer_id", "terminal_id", "tx_amount"])
assert all(amount >= 0 for amount in transaction_df.tx_amount.values)

In [33]:
transactions_df = customer_df.groupby('customer_id').apply(lambda x : generate_transactions_table(x.iloc[0], nb_days=5)).reset_index(drop=True)
transactions_df.head()

Unnamed: 0,tx_datetime,customer_id,terminal_id,tx_amount,tx_time_seconds,tx_time_days
0,2018-04-01 07:19:05,0,3,123.59,26345,0
1,2018-04-01 19:02:02,0,3,46.51,68522,0
2,2018-04-01 18:00:16,0,0,77.34,64816,0
3,2018-04-02 15:13:02,0,2,32.35,141182,1
4,2018-04-02 14:05:38,0,3,63.3,137138,1


## Step 5: Generate a complete dataset

In [34]:
from src.data.generator import generate_dataset

In [35]:
(customer_profiles_table, terminal_profiles_table, transactions_df)=\
    generate_dataset(n_customers=5, 
                     n_terminals=5, 
                     nb_days=5, 
                     start_date="2018-04-01", 
                     r=50)

Time to generate customer profiles table: 0.00038s
Time to generate terminal profiles table: 0.00023s
Time to associate terminals to customers: 0.0027s
Time to generate transactions: 0.011s


In [40]:
gen_config = config["data"]["generator"]

In [41]:
(customer_profiles_table, terminal_profiles_table, transactions_df)=\
    generate_dataset(n_customers=gen_config["num_customers"],
                     n_terminals=gen_config["num_terminals"],
                     nb_days=gen_config["num_days"],
                     start_date=gen_config["start_date"],
                     r=gen_config["customer_radius"])

Time to generate customer profiles table: 0.036s
Time to generate terminal profiles table: 0.027s
Time to associate terminals to customers: 0.78s
Time to generate transactions: 4.1e+01s


## Step 6: Add fraudulent transactions

This last step of the simulation adds fraudulent transactions to the dataset, using the following fraud scenarios:

- **Scenario 1:** Any transaction whose amount is more than 220 is a fraud. This scenario is not inspired by a real-world scenario. Rather, it will provide an obvious fraud pattern that should be detected by any baseline fraud detector. This will be useful to validate the implementation of a fraud detection technique.

- **Scenario 2:** Every day, a list of two terminals is drawn at random. All transactions on these terminals in the next 28 days will be marked as fraudulent. This scenario simulates a criminal use of a terminal, through phishing for example. Detecting this scenario will be possible by adding features that keep track of the number of fraudulent transactions on the terminal. Since the terminal is only compromised for 28 days, additional strategies that involve concept drift will need to be designed to efficiently deal with this scenario.

- **Scenario 3:** Every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent. This scenario simulates a card-not-present fraud where the credentials of a customer have been leaked. The customer continues to make transactions, and transactions of higher values are made by the fraudster who tries to maximize their gains. Detecting this scenario will require adding features that keep track of the spending habits of the customer. As for scenario 2, since the card is only temporarily compromised, additional strategies that involve concept drift should also be designed.

In [42]:
from src.data.generator import add_frauds

In [43]:
transactions_df = add_frauds(customer_profiles_table, terminal_profiles_table, transactions_df)

Number of frauds from scenario 1: 978
Number of frauds from scenario 2: 9099
Number of frauds from scenario 3: 4604
CPU times: user 43.4 s, sys: 673 ms, total: 44 s
Wall time: 44.1 s


In [44]:
transactions_df.tx_fraud.mean()

0.008369271814634397

In [45]:
transactions_df.tx_fraud.sum()

14681

In [47]:
transactions_df.head()

Unnamed: 0,transaction_id,tx_datetime,customer_id,terminal_id,tx_amount,tx_time_seconds,tx_time_days,tx_fraud,tx_fraud_scenario
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.0,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0


The resulting dataset is interesting: It features class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features, non-trivial relationships between features, and time-dependent fraud scenarios.

## Step 7: Store data

In [49]:
DIR_OUTPUT = Path.cwd().parent / "data/raw/"
DIR_OUTPUT.mkdir(parents=True, exist_ok=True)

In [52]:
str(gen_config["start_date"])

'2023-02-01'

In [54]:
import datetime

start_date = datetime.datetime.strptime(str(gen_config["start_date"]), "%Y-%m-%d")

for day in range(transactions_df.tx_time_days.max()+1):
    
    transactions_day = transactions_df[transactions_df.tx_time_days==day].sort_values('tx_time_seconds')
    
    date = start_date + datetime.timedelta(days=day)
    filename_output = date.strftime("%Y-%m-%d")+'.pkl'
    
    transactions_day.to_pickle(DIR_OUTPUT / filename_output)