# Data generation through simulation of credit card transactions

## Data generation process

The simulation will consist of five main steps (follows this [book chapter](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html)):

1. **Generation of customer profiles:** Every customer is different in their spending habits. This will be simulated by defining some properties for each customer. The main properties will be their geographical location, their spending frequency, and their spending amounts. The customer properties will be represented as a table, referred to as the customer profile table.

2. **Generation of terminal profiles:** Terminal properties will simply consist of a geographical location. The terminal properties will be represented as a table, referred to as the terminal profile table.

3. **Association of customer profiles to terminals:** We will assume that customers only make transactions on terminals that are within a radius of of their geographical locations. This makes the simple assumption that a customer only makes transactions on terminals that are geographically close to their location. This step will consist of adding a feature ‘list_terminals’ to each customer profile, that contains the set of terminals that a customer can use.

4. **Generation of transactions:** The simulator will loop over the set of customer profiles, and generate transactions according to their properties (spending frequencies and amounts, and available terminals). This will result in a table of transactions.

5. **Generation of fraud scenarios:** This last step will label the transactions as legitimate or genuine. This will be done by following three different fraud scenarios.

## Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
from pathlib import Path
from src.utils import load_config

config_path = Path.cwd() / "config.yaml"
config_path.exists()

True

In [4]:
config = load_config(config_path)

## Step 1: Generate customer profiles

In [2]:
from src.data.generator import generate_customer_profiles_table

In [13]:
customer_df = generate_customer_profiles_table(n_customers=5)
customer_df.head()

Unnamed: 0,customer_id,loc_long_coord,loc_lat_coord,mean_amount,std_amount,mean_nb_tx_per_day
0,0,54.88135,71.518937,62.262521,31.13126,2.179533
1,1,42.36548,64.589411,46.570785,23.285393,3.567092
2,2,96.366276,38.344152,80.213879,40.106939,2.11558
3,3,56.804456,92.559664,11.748426,5.874213,0.348517
4,4,2.02184,83.261985,78.924891,39.462446,3.480049


In [16]:
assert len(customer_df) == 5
assert all([col in customer_df.columns for col in ["customer_id", "loc_long_coord", "loc_lat_coord", "mean_amount", "std_amount", "mean_nb_tx_per_day"]])
assert all([long >= 0 and long <= 100 for long in customer_df.loc_long_coord.values])
assert all([lat >= 0 and lat <= 100 for lat in customer_df.loc_lat_coord.values])
assert all([mean >= 5 and mean <= 100 for mean in customer_df.mean_amount.values])
assert all([std >= 2.5 and std <= 50 for std in customer_df.std_amount.values])
assert all([mean >= 0 and mean <= 4 for mean in customer_df.mean_nb_tx_per_day.values])

## Step 2: Generate terminal profiles

In [7]:
from src.data.generator import generate_terminal_profiles_table

In [9]:
terminal_df = generate_terminal_profiles_table(n_terminals=5)
terminal_df.head()

Unnamed: 0,terminal_id,loc_long_coord,loc_lat_coord
0,0,54.88135,71.518937
1,1,60.276338,54.488318
2,2,42.36548,64.589411
3,3,43.758721,89.1773
4,4,96.366276,38.344152


In [17]:
assert len(terminal_df) == 5
assert all(col in terminal_df.columns for col in ["terminal_id", "loc_long_coord", "loc_lat_coord"])
assert all([long >= 0 and long <= 100 for long in terminal_df["loc_long_coord"]])
assert all([lat >= 0 and lat <= 100 for lat in terminal_df["loc_lat_coord"]])

## Step 3: Association of customer profiles to terminals

In [18]:
from src.data.generator import get_list_terminals_within_radius

In [23]:
# We first get the geographical locations of all terminals as a numpy array
x_y_terminals = terminal_df[['loc_long_coord','loc_lat_coord']].values.astype(float)
# And get the list of terminals within radius of 50 for the last customer
available_terminals = get_list_terminals_within_radius(customer_df.iloc[4], x_y_terminals=x_y_terminals, r=50)

In [24]:
assert len(available_terminals) == 2
assert all(term in available_terminals for term in [2, 3])

In [25]:
customer_df['available_terminals']=customer_df.apply(lambda x : get_list_terminals_within_radius(x, x_y_terminals=x_y_terminals, r=50), axis=1)

## Step 4: Generation of transactions

In [27]:
from src.data.generator import generate_transactions_table

In [29]:
transaction_df=generate_transactions_table(customer_df.iloc[0], 
                                                         start_date = "2018-04-01", 
                                                         nb_days = 5)
transaction_df.head()

Unnamed: 0,tx_datetime,customer_id,terminal_id,tx_amount,tx_time_seconds,tx_time_days
0,2018-04-01 07:19:05,0,3,123.59,26345,0
1,2018-04-01 19:02:02,0,3,46.51,68522,0
2,2018-04-01 18:00:16,0,0,77.34,64816,0
3,2018-04-02 15:13:02,0,2,32.35,141182,1
4,2018-04-02 14:05:38,0,3,63.3,137138,1


In [32]:
assert all(col in transaction_df.columns for col in ["tx_time_seconds", "tx_time_days", "customer_id", "terminal_id", "tx_amount"])
assert all(amount >= 0 for amount in transaction_df.tx_amount.values)

In [33]:
transactions_df = customer_df.groupby('customer_id').apply(lambda x : generate_transactions_table(x.iloc[0], nb_days=5)).reset_index(drop=True)
transactions_df.head()

Unnamed: 0,tx_datetime,customer_id,terminal_id,tx_amount,tx_time_seconds,tx_time_days
0,2018-04-01 07:19:05,0,3,123.59,26345,0
1,2018-04-01 19:02:02,0,3,46.51,68522,0
2,2018-04-01 18:00:16,0,0,77.34,64816,0
3,2018-04-02 15:13:02,0,2,32.35,141182,1
4,2018-04-02 14:05:38,0,3,63.3,137138,1
