# Simulation Strategy: Financial Risk Data Generation

**Module:** `src/generate_data.py`

This notebook explains the architecture of the generation script and the logic behind every function. The primary objective of this script is to generate a synthetic dataset for the Risk Analytics project, simulating a banking transaction environment that is statistically coherent and contains realistic anomalies. Due to limitations in the method, we will simulate only Account Takeover (ATO) fraud.

That said, given that the focus of this project is on analysis and inference, I intend to make the most of a basic Python script to achieve robust results, without over-complicating the implementation excessively.

## Execution Pipeline Overview

The data generation process follows a linear pipeline orchestrated by the function `create_base_transactions` followed by `create_anomalies`. The flow is structured as follows:

1. **Population Initialization:** Creation of customers and assignment of profiles/locations.
2. **Statistical Pre-computation:** Calculation of mathematical parameters (Log-Normal) and weights.
3. **Fixed Expenses Generation:** Deterministic creation of recurring bills.
4. **Discretionary Expenses Generation:** Probabilistic creation of daily spending.
5. **Fraud Injection:** Introduction of specific anomaly patterns.

## Phase 1: Population Initialization and Voume Sizing

**Functions:** `assign_profiles`, `assign_locations`

Before initializing the simulation, we calculated the optimal dataset size by working backward from our Machine Learning requirements. 

To ensure stable performance for tree-based models like Random Forest and XGBoost, we require a sufficient sample of the minority class. Empirical evidence suggests that exceeding 500 positive fraud cases yields diminishing returns regarding model improvement versus computational cost. So we aim to generate 500 ATO fraud cases, which represent around 1.5% (1.2% in FinTech and 2.5% on global finance platforms) of total transaction volume according to Sift Q3 2025 Digital Trust Index. Therefore, the total number of transactions simulated should be 33,333. We rounded this number to 35,000 so it is easier to work with. 

To ensure realism, we benchmarked against major neobanks like Revolut, which reports approximately 1 billion monthly transactions across 65 million users. Adjusting for an estimated 50% active user rate, we derived an average frequency of ~30.8 transactions per month per active client.

Based on a target engagement of 35 transactions per month, we calculated that a customer base of approximately 1,000 clients is required to naturally generate our target volume of 35,000 transactions.

Before any transaction occurs, we establish the identity of the actors in the system.

### Customer Identity and Segmentation

We generate `num_customers` unique IDs and assign each one a Behavioral Profile based on demographic weights:

* **Thrifty (20%):** Low spending, high focus on essentials.
* **Standard (60%):** Balanced consumption.
* **Well-off (8%):** High luxury/transportation spending.
* **Techie (12%):** High retail/tech spending.

### Selection Probability

We calculate a `selection_prob` vector. A "Well-off" customer has a higher frequency weight (2.0) than a "Thrifty" one (0.35), meaning they are statistically more likely to generate transactions in the subsequent steps.

### Home Location Assignment

We deterministically assign a primary residence to each `Customer_ID` from the `locations` list. This variable acts as a constraint for the generation of discretionary expenses, forcing 95% of transactions to originate from this specific city.

## Phase 2: Statistical Pre-computation

**Functions:** `precompute_lognormal_params`, `calculate_profile_category_weights`

To ensure performance and mathematical rigor, we pre-calculate the statistical distributions before entering the generation loops.

### Transaction Amount Generation: Log-Normal Distribution

Transaction amounts cannot be negative and are heavily skewed. We use the **Log-Normal Distribution**:

$$
\ln(X) \sim \mathcal{N}(\mu, \sigma^2)
$$

To bridge business requirements (Mean/Std) with this distribution, we use the method of moments derived from the Moment Generating Function (MGF) in `get_lognormal_params`:

```python 
def get_lognormal_params(mean, std):
    variance = std**2
    sigma_sq = np.log(1 + (variance / (mean**2)))
    sigma = np.sqrt(sigma_sq)
    mu = np.log(mean) - 0.5 * sigma_sq
    return mu, sigma
``` 

## Phase 3: Generating Fixed Expenses

**Function:** `generate_fixed_expenses`

This step simulates the deterministic nature of the banking system. For Fixed Expenses (like Housing and Utilities), the logic is Person-Centric. Unlike discretionary spending, which happens randomly in time, fixed expenses are monthly obligations attached to a specific client.

Therefore, the code structure is as follows:

### Outer Loop (The Account Holder)
We iterate through every single customer in our database (`for cust_id in customers`). This ensures no one is skipped; every client is evaluated for their monthly bills.

### Inner Loop (The Bill Checklist)
 Once we select a customer, we iterate through the list of fixed categories (`for cat in fixed_categories`). For each category, we ask: 'Does this specific customer pay this specific bill?' (based on the penetration probability). Finally we generate a date based on the expense category and an import amount based on the client profile. The location is the customer's assigned Home Location.

## Phase 4: Generating Discretionary Expenses

**Function:** `generate_discretionary_expenses`

This is the core of the simulation, handling variable daily spending (Retail, Leisure, Travel).

### Customer & Category Selection

Who: A customer is selected based on their activity frequency (`selection_probs`).

What: A category is chosen based on the profile's specific `category_weights` (e.g., a Techie prefers Retail).

### Temporal Dynamics: Bimodal Hour Distribution and Weekly Seasonality

Human activity follows a dual-peak pattern. To simulate this without rigid rules, `generate_timestamp_from_category` calls `generate_bimodal_hour`, which uses a Gaussian Mixture Model (GMM):

- **Morning Peak:** $\mathcal{N}(\mu=10, \sigma=2.5)$

- **Evening Peak:** $\mathcal{N}(\mu=20, \sigma=2.0)$

This creates a natural flow of time where 4:00 AM transactions are mathematically rare.

We apply multipliers based on the day of the week. For example, the probability of generating a Leisure transaction triples on Saturdays (multiplier=3.0), reflecting social habits.

### Location Selection

Each transaction has a 95% probability of occurring in the customer's Home Location; otherwise, a random location is selected from the available list.

## Phase 5: Fraud Injection (Anomalies)

**Function:** `create_anomalies`

Finally, we inject controlled noise to train the Risk Model. Unlike the previous steps, these transactions follow attack patterns, not consumer behavior.

### Velocity Attack

**Logic:**  A bot executing a high volume of small-value transactions on a vulnerable merchant. The goal is to bypass security thresholds (such as 2FA/SCA checks) that typically only trigger for larger amounts.

**Pattern:** 5-15 transactions, minimal time delta (minutes), same `Terminal_ID` and `Location`.

**Timing:** Often scheduled during off-hours (02:00 - 06:00) to avoid detection.

### Magnitude Fraud

**Logic:** A "Cash-out" attempt to steal maximum funds.

**Pattern:** A single transaction with an amount calculated as:

$$\text{FraudAmount} = \text{NormalAmount} \times \text{Multiplier}(10, 20)$$

**Detection Logic:** This creates a contextual outlier. A 500â‚¬ fraud on a "Thrifty" user is an anomaly ($Z\text{-score} > 5$), whereas on a "Well-off" user, it might pass as normal.

## Conclusion

Developing this simulation required research into the operational mechanics of financial systems. I had to investigate how real-world transactions are structured, from the seasonality of consumer spending and the statistical distributions of purchase amounts (Log-Normal) to the specific modus operandi of fraud attacks like Velocity and Account Takeover.

This attempt to grasp a basic undertanding of the "why" and "how" of transactional data is essential to properly scope the subsequent phases of the project (Exploratory Data Analysis (EDA) and Model Training).