# 0. Importing Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import random
from pandarallel import pandarallel # for parallel computations with pandas
import time

# 1. Transaction Data Simulator

For confidentiality reasons, it is hard to find data related to fraud detection problems, despite its importance. Therefore, I will use a transaction data simulator to generate data for this project. Please refer to the [recent book about ML for fraud detection](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html) that introduces the simulator for more details. In the following, I will briefly explain the main ideas behind the transactions data generating process.

The following table describes the features that will be present in each transaction.

| Transaction Feature | Description |
|:-------------------:|:-----------:|
|TRANSACTION_ID | Unique identifier for the transaction.|
|TX_DATETIME | Date and time of the transaction. |
|CUSTOMER_ID | Unique identifier of the customer who made the transaction. |
|TERMINAL_ID | Unique identifier of the terminal from which the transaction was made. |
|TX_AMOUNT | The amount of the transaction. |
|TX_FRAUD | Indicator variable for frauds (0 - Legitimate; 1 - Fraud).|



The simulation process will have five main steps:

1. Generation of customers profiles: each customer has different spending habits based on individual properties.

2. Generation of terminal profiles: defines the geographical location of the terminal.

3. Association of customer profiles to terminals: assumes that customers can only make transactions in terminals within a radius *r* of their geographical locations. Adds a feature 'list_terminals' to each customer profile, containing a set of terminals that the customer can use.

4. Generation of transactions: generates transactions by the customer profiles.

5. Generation of fraud scenarios: uses three different approaches to label some transactions as fraudulent.

## 1.1. Customer profiles generation

The customers will be modelled by the following attributes:

- CUSTOMER_ID: unique identifier of the customer.

- (x_customer, y_customer): pair of real coordinates in a 100 x 100 grid that defines the geographical location of the customer.

- (mean_amount, std_amount): mean and standard deviation of the transaction amounts for the customer, assuming that the transaction amounts follow a normal distribution. The mean_amount will be drawn from a uniform distribution (5,100) and the std_amount will be set as the mean_amount divided by two.

- mean_nb_tx_per_day: The average number of transactions per day for the customer, assuming that the number of transactions per day follows a Poisson distribution. This number will be drawn from a uniform distribution (0,4).

In [2]:
def generate_customer_profiles_table(n_customers, random_state=42):
    """
    Generates a table of customer profiles.
    
    Each customer profile consists of the folowing properties:
    * CUSTOMER_ID: unique identifier of the customer.
    * (x_customer, y_customer): pair of real coordinates in a 100 x 100 grid that defines the geographical location
    of the customer.
    * (mean_amount, std_amount): mean and standard deviation of the transaction amounts for the customer, assuming
    that the transaction amounts follow a normal distribution. The mean_amount will be drawn from a uniform
    distribution (5,100) and the std_amount will be set as the mean_amount divided by two.
    * mean_nb_tx_per_day: The average number of transactions per day for the customer, assuming that the number of
    transactions per day follows a Poisson distribution. This number will be drawn from a uniform distribution (0,4).
    
    Parameters
    ----------
    n_customers: int
        Number of customers for which to generate a profile
    random_state: int
        Random state for reproducibility of the results.
    
    Returns
    -------
    customer_profiles_table: pandas.DataFrame
        Table containing the properties of each customer.
        
    """
    # Setting up
    np.random.seed(random_state)
    MAX_COORDINATE = 100
    MIN_MEAN_AMOUNT = 5
    MAX_MEAN_AMOUNT = 100
    MAX_MEAN_NB_TX_PER_DAY = 4
    
    # Generate customer profiles
    customers_properties = []
    for customer_id in range(n_customers):
        x_customer = np.random.uniform(0, MAX_COORDINATE)
        y_customer = np.random.uniform(0, MAX_COORDINATE)
        mean_amount = np.random.uniform(MIN_MEAN_AMOUNT, MAX_MEAN_AMOUNT)
        std_amount = mean_amount / 2
        mean_nb_tx_per_day = np.random.uniform(0, MAX_MEAN_NB_TX_PER_DAY)
        customers_properties.append([
            customer_id,
            x_customer, y_customer,
            mean_amount, std_amount,
            mean_nb_tx_per_day
        ])
    customer_profiles_table = pd.DataFrame(customers_properties, columns=['CUSTOMER_ID',
                                                                         'x_customer', 'y_customer',
                                                                         'mean_amount', 'std_amount',
                                                                         'mean_nb_tx_per_day'])
    return customer_profiles_table 

## 1.2. Terminal profiles generation

The terminals will be modelled by the following attributes:

- TERMINAL_ID: unique identifier of the terminal.

- (x_terminal, y_terminal): pair of real coordinates in a 100 x 100 grid that defines the geographical location of the temrinal.

In [3]:
def generate_terminal_profiles_table(n_terminals, random_state=42):
    """
    Generates a table of terminal profiles.
    
    Each terminal profile consists of the folowing properties:
    * TERMINAL_ID: unique identifier of the terminal.
    * (x_temrinal, y_temrinal): pair of real coordinates in a 100 x 100 grid that defines the geographical location
    of the terminal.
    
    Parameters
    ----------
    n_terminals: int
        Number of terminals for which to generate a profile
    random_state: int
        Random state for reproducibility of the results.
    
    Returns
    -------
    terminal_profiles_table: pandas.DataFrame
        Table containing the properties of each terminal.
        
    """
    # Setting up
    np.random.seed(random_state)
    MAX_COORDINATE = 100

    # Generate terminal profiles
    terminals_properties = []
    for terminal_id in range(n_terminals):
        x_terminal = np.random.uniform(0, MAX_COORDINATE)
        y_terminal = np.random.uniform(0, MAX_COORDINATE)
        terminals_properties.append([
            terminal_id,
            x_terminal, y_terminal,
        ])
    terminal_profiles_table = pd.DataFrame(terminals_properties, columns=['TERMINAL_ID',
                                                                         'x_customer', 'y_customer',
                                                                         ])
    return terminal_profiles_table 

## 1.3. Association of customer profiles to terminals

In [4]:
def get_list_terminals_within_radius(customer_profile, x_y_terminals, r):
    """
    Generates the list of all terminals within a radius 'r' of the customer.
    
    The simulation assumes that customers can only make transactions in terminals within a radius r of their
    geographical locations, which adds a feature 'list_terminals' to each customer profile, containing a set of
    terminals that the customer can use.
    
    Parameters
    ----------
    customer_profile: pandas.Series
        The profile of a customer (a row from customer_profiles_table).
    x_y_terminals: numpy.ndarray
        An array that contains the geographical locations of all terminals.
    r: float
        Radius around the customer location to determine terminal availability.
    
    Returns
    -------
    list_terminals: list
        List of the IDs of the terminals available to the customer.
        
    """
    # Location of the customer
    x_y_customer = customer_profile[['x_customer', 'y_customer']].values.astype(float)
    
    # Squared difference in coordinates between customer and terminal locations
    squared_diff_x_y = np.square(x_y_customer - x_y_terminals)
    
    # Sum along rows and compute sqrt to get distance
    dist_x_y = np.sqrt(np.sum(squared_diff_x_y, axis=1))
    
    # Get the indices of terminals which are at a distance less than r
    list_terminals = list(np.where(dist_x_y < r)[0])

    return list_terminals

## 1.4. Generation of transactions

The customers' properties regulate transaction features such as frequency and typical amounts. Besides this, the simulations will try to reflect that most transactions occur during the day. Therefore, the transaction times, in seconds along a day, will follow a normal distribution centered around noon with an std of 20000 seconds. 

In [5]:
def generate_transactions_table(customer_profile, start_date='2022-01-01', nb_days=200):
    """
    Generates a table of unlabeled transactions by a customer profile.
    
    The customer's properties regulate transaction features such as frequency and typical amounts. The transaction
    times, in seconds along a day, will follow a normal distribution centered around noon with an std of 20000 
    seconds. 
    
    Parameters
    ----------
    customer_profile: pandas.Series
        The profile of a customer (a row from customer_profiles_table).
    start_date: str
        The date of the first transaction in the format "YYYY-MM-DD".
    nb_days: int
        The number of days to generate transactions.
        
    Returns
    -------
    customer_transactions: pandas.DataFrame
        Table with transactions made by the customer over the specified period.

    """
    # Setting up
    customer_transactions = []
    random.seed(int(customer_profile.CUSTOMER_ID))
    np.random.seed(int(customer_profile.CUSTOMER_ID))
    TOTAL_SECONDS_DAY = 86400
    MEAN_TX_TIME = TOTAL_SECONDS_DAY / 2 # corresponding to noon
    STD_TX_TIME = 20000
    
    
    for day in range(nb_days):
        # number of transactions for this day
        nb_tx = np.random.poisson(customer_profile.mean_nb_tx_per_day)
        
        # if customer did some transactions, generate them
        if nb_tx > 0: 
            for tx in range(nb_tx):
                # time of the transaction, in seconds of that day.
                tx_time = int(np.random(MEAN_TX_TIME, STD_TX_TIME))
                # if transacton time is in the (0, TOTAL_SECONDS_DAY) interval, accept it
                if 0 < tx_time and tx_time < TOTAL_SECONDS_DAY:
                    amount = np.random.normal(customer_profile.mean_amount, customer_profile.std_amount)
                    # if amount is negative, draw from a uniform disribution
                    if amount < 0:
                        amount = np.random.uniform(0, customer_profile.mean_amount * 2)
                    else:
                        pass
                    amount = np.round(amount, decimals=2)
                    # verify terminal availability
                    if len(customer_profile.list_terminals) > 0:
                        terminal_id = random.choice(customer_profile.list_terminals)
                        customer_transactions.append([
                            time_tx + day * TOTAL_SECONDS_DAY, day,
                            customer_profile.CUSTOMER_ID,
                            terminal_id, amount
                        ])
                    else:
                        continue
                else: 
                    continue
        else:
            continue
    customer_transactions = pd.DataFrame(customer_transactions, columns=[
        'TX_DATETIME', 'CUSTOMER_ID', 'TERMINAL_ID', 'TX_AMOUNT', 'TX_TIME_SECONDS', 'TX_TIME_DAYS'
    ])
    
    if len(customer_transactions) > 0:
        customer_transactions['TX_DATETIME'] = pd.to_datetime(customer_transactions['TX_TIME_SECONDS'], 
                                                             unit='s',
                                                             origin=start_date)
    else:
        pass
    
    return customer_transactions

At this point, it is possible to generate a dataset of unlabeled transactions.

In [6]:
def generate_dataset(n_customers=5000, n_terminals=10000, nb_days=200, start_date='2022-01-01', r=5):
    '''
    Generates an unlabeled dataset of transactions and displays relevant timing information.
    
    Each unlabeled transaction has the following features:
    * TRANSACTION_ID: unique identifier for the transaction.
    * TX_DATETIME: date and time of the transaction.
    * CUSTOMER_ID: the unique identifier of the customer who made the transaction. 
    * TERMINAL_ID: the unique identifier of the terminal associated with the transaction. 
    * TX_AMOUNT: the amount of the transaction.
    * TX_SECONDS: the number of elapsed seconds since the date of the first recorded transaction.
    * TX_DAYS: the number of elapsed days since the date of the first recorded transaction.

    
    Parameters
    ----------
    n_customers: int
        The number of customers of the institution.
    n_terminals: int
        The number of terminals with support of the institution.
    nb_days: int
        The number of days to generate transactions.
    start_date: str
        The date of the first transaction in the format "YYYY-MM-DD".
    r: float
        The radius (distance) 'r' determines the terminals available to each customer.
        
    Returns
    -------
    customer_profiles_table: pandas.DataFrame
        Table with the properties of each customer.
    terminal_profiles_table: pandas.DataFrame
        Table with the properties of each terminal.
    transactions_df: pandas.DataFrame
        Table with all transactions of the simulation.
    
    '''
    # set up pandarallel to enable parallel_apply pandas methods
    pandarallel.initialize(use_memory_fs=False)

    # Generate customer profiles
    start_time = time.time()
    customer_profiles_table = generate_customer_profiles_table(n_customers)
    print(f'Time to generate customer profiles table: {time.time() - start_time:.2f}s.')
    
    # Generate terminal profiles
    start_time = time.time()
    terminal_profiles_table = generate_terminal_profiles_table(n_terminals)
    print(f'Time to generate terminal profiles table: {time.time() - start_time:.2f}s.')
    
    # Associate terminals to customers
    start_time = time.time()
    x_y_terminals = terminal_profiles_table[['x_terminal', 'y_terminal']].values.astype(float)
    customer_profiles_table['list_terminals'] = customer_profiles_table.parallel_apply(
        lambda x: get_list_terminals_within_radius(x, x_y_temrinals, r), axis=1
    )
    print(f'Time to associate terminals to customers: {time.time() - start_time:.2f}s.')
    
    
    # Generate transactions
    start_time = time.time()
    transactions_df = customer_profiles_table.groupby('CUSTOMER_ID').parallel_apply(
        lambda x: generate_transactions_table(x.iloc[0], start_date, nb_days)
    ).reset_index(drop=True)
    print(f'Time to generate transactions: {time.time() - start_time:.2f}s.')
    
    # Sort transactions chronologically
    transactions_df = transactions_df.sort_values('TX_DATETIME')
    
    # Reset indexes
    transactions_df.reset_index(inplace=True, drop=True)
    
    # TRANSACTION_ID are the dataframe indices, startng from 0
    transactions_df.reset_index(inplace=True)
    transactions_df.rename(columns={'index':'TRANSACTION_ID'}, inplace=True)
    
    return (customer_profiles_table, terminal_profiles_table, transactions_df)

In [7]:
customer_profiles_table, terminal_profiles_table, transactions_df = generate_dataset()

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Time to generate customer profiles table: 0.47s.


NameError: name 'x_customer' is not defined