# 0. Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd

# 1. Transaction Data Simulator

For confidentiality reasons, it is hard to find data related to fraud detection problems, despite its importance. Therefore, I will use a transaction data simulator to generate data for this project. Please refer to the [recent book about ML for fraud detection](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html) that introduces the simulator for more details. In the following, I will briefly explain the main ideas behind the transactions data generating process.

The following table describes the features that will be present in each transaction.

| Transaction Feature | Description |
|:-------------------:|:-----------:|
|TRANSACTION_ID | Unique identifier for the transaction.|
|TX_DATETIME | Date and time of the transaction. |
|CUSTOMER_ID | Unique identifier of the customer who made the transaction. |
|TERMINAL_ID | Unique identifier of the terminal from which the transaction was made. |
|TX_AMOUNT | The amount of the transaction. |
|TX_FRAUD | Indicator variable for frauds (0 - Legitimate; 1 - Fraud).|



The simulation process will have five main steps:

1. Generation of customers profiles: each customer has different spending habits based on individual properties.

2. Generation of terminal profiles: defines the geographical location of the terminal.

3. Association of customer profiles to terminals: assumes that customers can only make transactions in terminals within a radius *r* of their geographical locations. Adds a feature 'list_terminals' to each customer profile, containing a set of terminals that the customer can use.

4. Generation of transactions: generates transactions by the customer profiles.

5. Generation of fraud scenarios: uses three different approaches to label some transactions as fraudulent.

## 1.1. Customer profiles generation

The customers will be modelled by the following attributes:

- CUSTOMER_ID: unique identifier of the customer.

- (x_customer, y_customer): pair of real coordinates in a 100 x 100 grid that defines the geographical location of the customer.

- (mean_amount, std_amount): mean and standard deviation of the transaction amounts for the customer, assuming that the transaction amounts follow a normal distribution. The mean_amount will be drawn from a uniform distribution (5,100) and the std_amount will be set as the mean_amount divided by two.

- mean_nb_tx_per_day: The average number of transactions per day for the customer, assuming that the number of transactions per day follows a Poisson distribution. This number will be drawn from a uniform distribution (0,4).

In [1]:
def generate_customer_profiles_table(n_customers, random_state=42):
    """
    Generates a table of customer profiles.
    
    Each customer profile consists of the folowing properties:
    * CUSTOMER_ID: unique identifier of the customer.
    * (x_customer, y_customer): pair of real coordinates in a 100 x 100 grid that defines the geographical location
    of the customer.
    * (mean_amount, std_amount): mean and standard deviation of the transaction amounts for the customer, assuming
    that the transaction amounts follow a normal distribution. The mean_amount will be drawn from a uniform
    distribution (5,100) and the std_amount will be set as the mean_amount divided by two.
    * mean_nb_tx_per_day: The average number of transactions per day for the customer, assuming that the number of
    transactions per day follows a Poisson distribution. This number will be drawn from a uniform distribution (0,4).
    
    Parameters
    ----------
    n_customers: int
        Number of customers for which to generate a profile
    random_state: int
        Random state for reproducibility of the results.
    
    Returns
    -------
    customer_profiles_table: pandas.DataFrame
        Table containing the properties of each customer.
        
    """
    # Set random state
    np.random.seed(random_state)
    
    # Generate customer profiles
    customers_properties = []
    for customer_id in range(n_customers):
        x_customer = np.random.uniform(0, 100)
        y_customer = np.random.uniform(0, 100)
        mean_amount = np.random.uniform(5, 100)
        std_amount = mean_amount / 2
        mean_nb_tx_per_day = np.random.uniform(0, 4)
        customers_properties.append([
            customer_id,
            x_customer, y_customer,
            mean_amount, std_amount,
            mean_nb_tx_per_day
        ])
    customer_profiles_table = pd.DataFrame(customers_properties, columns=['CUSTOMER_ID',
                                                                         'x_customer', 'y_customer',
                                                                         'mean_amount', 'std_amount',
                                                                         'mean_nb_tx_per_day'])
    return customer_profiles_table 

## 1.2. Terminal profiles generation

The terminals will be modelled by the following attributes:

- TERMINAL_ID: unique identifier of the terminal.

- (x_terminal, y_terminal): pair of real coordinates in a 100 x 100 grid that defines the geographical location of the temrinal.

In [1]:
def generate_terminal_profiles_table(n_terminals, random_state=42):
    """
    Generates a table of terminal profiles.
    
    Each terminal profile consists of the folowing properties:
    * TERMINAL_ID: unique identifier of the terminal.
    * (x_temrinal, y_temrinal): pair of real coordinates in a 100 x 100 grid that defines the geographical location
    of the terminal.
    
    Parameters
    ----------
    n_terminals: int
        Number of terminals for which to generate a profile
    random_state: int
        Random state for reproducibility of the results.
    
    Returns
    -------
    terminal_profiles_table: pandas.DataFrame
        Table containing the properties of each terminal.
        
    """
    # Set random state
    np.random.seed(random_state)
    
    # Generate terminal profiles
    terminals_properties = []
    for terminal_id in range(n_terminals):
        x_terminal = np.random.uniform(0, 100)
        y_terminal = np.random.uniform(0, 100)
        terminals_properties.append([
            terminal_id,
            x_customer, y_customer,
        ])
    terminal_profiles_table = pd.DataFrame(terminals_properties, columns=['TERMINAL_ID',
                                                                         'x_customer', 'y_customer',
                                                                         ])
    return terminal_profiles_table 

## 1.3. Association of customer profiles to terminals

In [2]:
def get_list_terminals_within_radius(customer_profile, x_y_terminals, r):
    """
    Generates the list of all terminals within a radius 'r' of the customer.
    
    The simulation assumes that customers can only make transactions in terminals within a radius r of their
    geographical locations, which adds a feature 'list_terminals' to each customer profile, containing a set of
    terminals that the customer can use.
    
    Parameters
    ----------
    customer_profile: pandas.Series
        The profile of a customer (a row from customer_profiles_table).
    x_y_terminals: numpy.ndarray
        An array that contains the geographical locations of all terminals.
    r: float
        Radius around the customer location to determine terminal availability.
    
    Returns
    -------
    list_terminals: list
        List of the IDs of the terminals available to the customer.
        
    """
    # Location of the customer
    x_y_customer = customer_profile[['x_customer', 'y_customer']].values.astype(float)
    
    # Squared difference in coordinates between customer and terminal locations
    squared_diff_x_y = np.square(x_y_customer - x_y_terminals)
    
    # Sum along rows and compute sqrt to get distance
    dist_x_y = np.sqrt(np.sum(squared_diff_x_y, axis=1))
    
    # Get the indices of terminals which are at a distance less than r
    list_terminals = list(np.where(dist_x_y < r)[0])

    return list_terminals

## 1.4. Generation of transactions