# **Notebook 01 - Synthetic Environments and Feature Construction**

## **Section 1 - Purpose of the Synthetic Environment**

### **1.1 Why a Synthetic Environment**

The empirical study of autonomous risk in intelligent systems presents a fundamental methodological challenge: real-world datasets embed **uncontrolled confounders, historical biases, regulatory artifacts, and socio-economic correlations** that obscure the structural mechanisms under investigation.

If the objective is to understand *how* autonomous risk emerges (rather than merely *where* it appears) then a controlled experimental setting becomes indispensable.

For this reason, this notebook adopts a **synthetic data paradigm**.

Synthetic environments allow us to:

* Explicitly define causal and correlational structures;
* Isolate the effects of autonomy, opacity, and feedback;
* Introduce controlled noise and uncertainty;
* Test hypothetical governance regimes without ethical or legal risk.

The goal is not realism per se, but **structural fidelity**: the environment must reproduce the *mechanisms* through which risk arises, not necessarily the surface statistics of any specific industry dataset.

### **Methodological Scope Clarification**

This synthetic environment is not designed to support population-level inference or external validity claims. Its sole purpose is structural identification: 

>isolating and stress-testing the causal mechanisms through which autonomous risk, opacity, and governance failure can emerge.

Consequently, the validity of this dataset should be evaluated in terms of causal clarity and internal coherence, not empirical representativeness.



### **1.2 Role of the Synthetic Dataset in the Project**

The dataset generated in this notebook serves as the **common substrate** for all subsequent notebooks.

Specifically, it will be used to:

* Train and evaluate supervised risk models (Notebook 02);
* Simulate anomaly and fraud detection pipelines (Notebook 03);
* Analyze opacity, interpretability, and auditability (Notebook 04);
* Study feedback loops, instability, and scheming indicators (Notebook 05);
* Support extensions toward AI Safety and AGI-aligned risk analysis (Notebook 06).

Any flaw, ambiguity, or inconsistency introduced at this stage would propagate downstream.
Accordingly, **this notebook prioritizes clarity, traceability, and reproducibility over sophistication**.



### **1.3 What This Environment Is, and Is Not**

To avoid conceptual drift, we explicitly delimit the scope of the synthetic environment.

This environment **is**:

* A controlled experimental laboratory;
* A generator of structured, interpretable signals;
* A testbed for autonomy-related risk dynamics.

This environment **is not**:

* A faithful representation of real populations;
* A benchmark for predictive performance;
* A substitute for domain-specific datasets.

No claim is made that models trained here are deployable in real-world settings.

Their value lies exclusively in **explanatory and analytical insight**.



### **1.4 Ethical and Normative Considerations**

All variables generated in this notebook are either:

* Abstract behavioral signals;
* Financial or transactional aggregates;
* Latent theoretical constructs.

No sensitive personal attributes (e.g., gender, race, religion, ethnicity) are included as causal drivers.

This design choice ensures that:

* Observed risk patterns cannot be attributed to protected characteristics;
* Ethical and legal concerns do not contaminate structural analysis;
* Subsequent discussions of fairness and governance remain conceptually clean.



### **1.5 Reproducibility Commitment**

All random processes in this notebook will be:

* Seeded explicitly;
* Documented step by step;
* Exported in deterministic formats.

This guarantees that:

* Independent readers can regenerate identical datasets;
* Results across notebooks remain consistent;
* Extensions and replications are feasible.



*With this foundation established, we now proceed to define the generative assumptions that govern the synthetic environment.*


## **Section 2 - Generative Assumptions of the Synthetic Environment**

### **2.1 Design Philosophy**

The synthetic environment is governed by a small set of **explicit generative assumptions**. These assumptions are not meant to reflect empirical truths about any specific domain, but to **instantiate structural conditions under which autonomous risk can be meaningfully studied**.

The guiding principle is **minimal sufficiency**:

> introduce only those mechanisms strictly necessary to observe autonomy, instability, opacity, and feedback-driven risk.

Every variable generated downstream must be traceable to at least one of these assumptions.



### **2.2 Population-Level Assumptions**

We assume the existence of a population of agents (or accounts, users, entities) characterized by heterogeneous but bounded attributes.

Formally:

* The population size ( N ) is fixed and known;
* Each entity is independent at generation time;
* Dependencies emerge only through model-mediated interactions and feedback loops (later notebooks).

This ensures that:

* Initial correlations are *designed*, not accidental;
* Emergent dependencies can be attributed to system behavior rather than data leakage.



### **2.3 Latent Behavioral Structure**

We posit the existence of latent behavioral traits that are **not directly observable**, but influence multiple observed variables.

These latent traits include:

* financial regularity vs. volatility;
* transactional intensity;
* behavioral consistency over time.

Importantly:

* These latent traits are *not labels;*
* They act as **common causes** for multiple observed features.

This design allows models to infer structure (and potentially overfit or misinterpret it) creating the conditions for opacity and autonomous behavior.



### **2.4 Noise and Imperfection Assumption**

All observed variables are assumed to be **noisy projections** of underlying processes.

Noise is introduced intentionally to:

* prevent trivial separability;
* avoid deterministic mappings;
* simulate epistemic uncertainty.

Noise terms are:

* independent across variables;
* bounded;
* stationary at generation time.

This ensures that:

* High confidence does not imply correctness;
* Model certainty can diverge from ground truth;
* Entropy and instability measures become meaningful.



### **2.5 Non-Stationarity Readiness**

Although the dataset generated in this notebook is static, it is designed to be **compatible with non-stationary extensions**.

Specifically:

* Feature definitions support temporal slicing;
* Aggregates can be recomputed under drift;
* Labels can be regenerated under shifting thresholds.

This prepares the environment for:

* drift detection (Notebook 03);
* feedback amplification (Notebook 05);
* governance stress tests (Notebook 06).



### **2.6 Absence of Intentional Agency**

Specifically, **no intentional agency is encoded** at the data generation level.

Entities do not:

* optimize objectives;
* adapt strategies;
* respond to incentives.

Any appearance of strategic behavior, scheming, or autonomy in later notebooks therefore arises **entirely from the models and deployment dynamics**, not from the data itself.

This separation is essential to support the core thesis:

> autonomous risk can emerge *without* intentional agents.


**Fundamentally, no optimization process, reward signal, or strategic adaptation is present at the data generation stage.**

Any subsequent appearance of goal-directed behavior, scheming, or evasive dynamics must therefore arise from model-mediated feedback loops and deployment conditions, not from embedded intent in the dataset itself. 

This establishes a strict causal boundary between data generation and emergent autonomy.


### **2.7 Summary of Assumptions**

In summary, the synthetic environment assumes:

1. Independent entities at generation time;
2. Latent behavioral structure affecting multiple observables;
3. Noisy, imperfect measurement;
4. Structural readiness for drift and feedback;
5. No embedded intent or agency.

These assumptions define the **sandbox** within which the remainder of the project operates.



*We now translate these assumptions into concrete variables and distributions.*


## **Section 3 - Base Variables and Population Generation**

### **3.1 Overview**

This section translates the generative assumptions into **concrete variables and distributions**. The goal is not realism in the econometric sense, but **structural plausibility**: variables must interact in ways that allow autonomy, opacity, and risk amplification to emerge downstream.

All variables generated here are **exogenous** to the models that will later consume them.



### **3.2 Population Size and Reproducibility**

We fix a population size (N) and enforce reproducibility through a global random seed.

```python
import numpy as np
import pandas as pd

np.random.seed(42)

N = 10_000
```

This choice balances:

* statistical richness;
* computational tractability;
* stability across experiments.



### **3.3 Demographic and Structural Attributes**

We begin with coarse-grained attributes that act as **contextual variables**, not decision drivers.

```python
idade = np.random.randint(18, 75, size=N)

estado_civil = np.random.choice(
    ["single", "married", "divorced", "widowed"],
    size=N,
    p=[0.45, 0.38, 0.12, 0.05]
)

escolaridade = np.random.choice(
    ["basic", "high_school", "college", "postgraduate"],
    size=N,
    p=[0.25, 0.40, 0.25, 0.10]
)

regiao = np.random.choice(
    ["north", "south", "east", "west", "central"],
    size=N
)
```

These variables:

* introduce heterogeneity;
* support later fairness and governance analysis;
* are **not** intended to be causal drivers of risk.


### **Normative Role of Demographic Variables**

Demographic attributes included in this dataset serve exclusively as observational variables for downstream fairness and governance analysis. They are not used as causal drivers, optimization targets, or label-generating features. Their presence enables normative audits without contaminating the structural analysis of autonomous risk.



### **3.4 Economic Capacity Proxies**

We introduce continuous variables representing **capacity and constraints.**

```python
renda_estim = np.random.lognormal(mean=8.5, sigma=0.6, size=N)

tempo_emprego = np.clip(
    np.random.exponential(scale=6, size=N),
    0,
    40
)
```

Notes:

* Lognormal income induces natural skewness;
* Employment time introduces stability variation;
* These variables will later correlate with behavior, not labels.



### **3.5 Financial Instrumentation**

We now generate variables that mediate **system exposure**.

```python
num_cartoes = np.random.randint(1, 6, size=N)

limite_total = renda_estim * np.random.uniform(0.5, 2.5, size=N)

utilizacao_media = np.clip(
    np.random.beta(2, 5, size=N),
    0, 1
)
```

Interpretation:

* Higher income allows higher limits;
* Utilization is bounded and asymmetric;
* Capacity ≠ usage (important for anomaly detection).



### **3.6 Transactional Behavior**

Behavioral intensity and irregularity are modeled independently.

```python
quant_transacoes = np.random.poisson(lam=30, size=N)

valor_medio_trans = np.random.lognormal(mean=4.0, sigma=0.7, size=N)

transacoes_incomuns = np.random.binomial(
    n=quant_transacoes,
    p=0.05
)
```

These variables will later support:

* anomaly detection;
* instability signals;
* emergent scheming indicators.



### **3.7 Risk-Related Proxies (Pre-Label)**

At this stage, we introduce **risk correlates**, not labels.

```python
historico_atrasos = np.random.poisson(lam=1.2, size=N)

divida_renda = np.clip(
    limite_total / (renda_estim + 1e-6),
    0, 5
)
```

These variables encode **latent stress**, not outcomes.



### **3.8 Assembly of the Base DataFrame**

We consolidate all variables into a single DataFrame.

```python
df = pd.DataFrame({
    "idade": idade,
    "estado_civil": estado_civil,
    "escolaridade": escolaridade,
    "regiao": regiao,
    "renda_estim": renda_estim,
    "tempo_emprego": tempo_emprego,
    "num_cartoes": num_cartoes,
    "limite_total": limite_total,
    "utilizacao_media": utilizacao_media,
    "quant_transacoes": quant_transacoes,
    "valor_medio_trans": valor_medio_trans,
    "transacoes_incomuns": transacoes_incomuns,
    "historico_atrasos": historico_atrasos,
    "divida_renda": divida_renda
})
```

Sanity check:

```python
df.head()
```



### **3.9 Design Guarantees**

At the end of this section, we guarantee that:

* No labels exist yet;
* No intentional behavior is encoded;
* No optimization has occurred;
* All structure is latent and noisy.

This ensures that **any risk or autonomy detected later is model-induced**.


## **Section 4 - Synthetic Risk Signals and Label Construction**

### **4.1 Rationale**

This section introduces **risk-related signals** and **labels** that will later be used by supervised and unsupervised models. The central methodological constraint is the following:

> **Labels must emerge from structured interactions between variables, never be trivially encoded in a single feature.**

If labels are too obvious, downstream models merely learn shortcuts. If labels are too random, no meaningful structure can emerge. Our objective is a **controlled middle ground**.



### **4.2 Latent Risk Score (Unobserved)**

We begin by constructing a *latent* continuous risk score. This score is **never directly exposed** to models.

```python
latent_risk = (
    0.35 * df["divida_renda"] +
    0.25 * df["historico_atrasos"] +
    0.20 * df["utilizacao_media"] +
    0.10 * (df["transacoes_incomuns"] / (df["quant_transacoes"] + 1e-6)) +
    0.10 * np.random.normal(0, 1, size=N)
)
```

Key properties:

* Multi-factor composition;
* Noise injected explicitly;
* No demographic variables included;
* No hard thresholds.



### **4.3 Normalized Probability of Adverse Outcome**

We convert the latent score into a probability-like signal using a logistic transformation.

```python
prob_inadimplencia = 1 / (1 + np.exp(-latent_risk))
```

Add to the DataFrame:

```python
df["prob_inadimplencia"] = prob_inadimplencia
```

This variable:

* Is continuous;
* Represents *risk propensity*, not realization;
* Will later be used to derive multiple labels.



### **4.4 Primary Label: Credit Default (Binary)**

We define a **quantile-based label**, ensuring class balance control without leaking scale information.

```python
threshold = df["prob_inadimplencia"].quantile(0.90)

df["label_default"] = (df["prob_inadimplencia"] > threshold).astype(int)
```

Sanity check:

```python
df["label_default"].value_counts(normalize=True)
```

This ensures approximately **10% positives**, suitable for:

* credit risk modeling;
* ROC/PR evaluation;
* stress-testing autonomy effects.



### **4.5 Secondary Label: Simulated Fraud (Orthogonal)**

Fraud is modeled as **partially correlated but not identical** to default risk.

```python
fraude_simulada = (
    (df["transacoes_incomuns"] > np.percentile(df["transacoes_incomuns"], 95)) &
    (df["utilizacao_media"] > 0.7) &
    (np.random.rand(N) < 0.6)
).astype(int)

df["fraude_simulada"] = fraude_simulada
```

Important:

* Fraud ≠ default;
* Some overlap exists, but neither subsumes the other;
* Enables multi-task risk analysis later.



### **4.6 Behavioral Anomaly Flag (Unsupervised Proxy)**

We introduce a **non-label anomaly signal**, to be used in unsupervised settings.

```python
anomalia_padrao = (
    (df["quant_transacoes"] > df["quant_transacoes"].quantile(0.95)) |
    (df["valor_medio_trans"] > df["valor_medio_trans"].quantile(0.95))
).astype(int)

df["anomalia_padrao"] = anomalia_padrao
```

This variable:

* Is intentionally crude;
* Represents *surface-level abnormality*;
* Will later be contrasted with learned anomalies.



### **4.7 Behavioral Suspicion Index (Soft Signal)**

We add a soft behavioral flag to capture borderline patterns.

```python
comportamento_suspeito = (
    0.5 * df["utilizacao_media"] +
    0.3 * (df["transacoes_incomuns"] / (df["quant_transacoes"] + 1e-6)) +
    0.2 * np.random.rand(N)
)

df["comportamento_suspeito"] = comportamento_suspeito
```

This variable:

* Is continuous;
* Is not a label;
* Supports later interpretability and risk layering.



### **4.8 Final Consistency Checks**

```python
assert df["label_default"].nunique() == 2
assert df["fraude_simulada"].nunique() == 2
assert df.isnull().sum().sum() == 0
```



### **4.9 Conceptual Guarantees**

At the end of this section:

* Labels are **derived**, not injected;
* Multiple risk notions coexist;
* No single feature trivially predicts any label;
* Future autonomy and opacity emerge **from modeling**, not data leakage.

This is a *clean causal boundary* between data generation and model behavior.

### **Interpretation of Opacity (O)**

Opacity is operationalized here as a proxy for epistemic distance between system internals and external oversight, not as a direct measure of risk, complexity, or performance. Its construction intentionally combines autonomy and imperfect external signals to reflect auditability loss rather than outcome severity.


## **Section 5 - Dataset Finalization, Versioning and Reproducibility**

### **5.1 Objective of This Section**

This section has three clear goals:

1. **Freeze** the dataset in a reproducible state;
2. **Document** its structure and guarantees;
3. **Persist** the dataset in formats suitable for all downstream notebooks.

From this point onward:

> **No notebook is allowed to mutate the raw dataset.**
> All transformations must be explicit, derived, and documented.

This is essential for scientific validity and later publication.



### **5.2 Final Dataset Overview**

Before saving, we inspect the final schema.

```python
df.columns.tolist()
```

Expected groups of variables:

#### **Profile & Socioeconomic (non-sensitive)**

* idade
* estado_civil
* escolaridade
* renda_estim
* regiao
* tipo_emprego
* tempo_emprego

#### **Financial & Transactional**

* num_cartoes
* limite_total
* utilizacao_media
* historico_atrasos
* score_credito
* divida_renda
* quant_transacoes
* valor_medio_trans
* compra_internacional
* assinaturas
* transacoes_incomuns

#### **Risk & Behavioral Signals**

* prob_inadimplencia
* label_default
* fraude_simulada
* anomalia_padrao
* comportamento_suspeito

> **No protected attributes** (e.g., sex, race, religion) are present.
> This is a deliberate design choice aligned with governance constraints explored later.



### **5.3 Dataset Integrity Checks**

We perform structural and statistical sanity checks.

```python
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())
print("\nLabel distribution:")
print(df["label_default"].value_counts(normalize=True))
```

Optional: correlation snapshot (for human inspection only):

```python
df[
    ["prob_inadimplencia", "label_default", "fraude_simulada", "anomalia_padrao"]
].corr()
```

These checks are **diagnostic**, not used by models.



### **5.4 Dataset Versioning Strategy**

We explicitly version the dataset to prevent silent drift.

**Version naming convention:**

```
DatasetFinanceiro_v3.2
```

For this notebook:

* Sample size: `10000`
* Major version: `1`
* Minor version: `0`

**Version:** `DatasetFinanceiro_v3.2.csv`



### **5.5 Persisting the Dataset**

We save the dataset in two complementary formats.

#### CSV (human-readable, GitHub-friendly)

```python
df.to_csv(
    "DatasetFinanceiro_v3.2.csv",
    index=False
)
```

#### Parquet (efficient, analytics-friendly)

```python
df.to_parquet(
    ""DatasetFinanceiro_v3.2.parquet",
    index=False
)
```

Confirmation:

```python
print(""DatasetFinanceiro_v3.2 saved successfully.")
df.head()
```



### **5.6 Dataset Evolution Across the Project**

This notebook introduces the base synthetic environment and the core variables used throughout the project. However, it does not represent the final state of the dataset.

As the theoretical framework is progressively operationalized, later notebooks introduce additional features, including uncertainty proxies, drift measures, anomaly signals, and higher-order interaction terms.

The final dataset used for all consolidated analyses is:

> **DatasetFinanceiro_v3.2**

Earlier versions generated in this notebook and intermediate stages are maintained for reproducibility but should be interpreted as developmental snapshots rather than final empirical objects.



### **5.7 Reproducibility Guarantees**

This notebook guarantees:

* Deterministic generation (via fixed random seeds);
* Explicit feature construction;
* Clear separation between:

  * raw signals;
  * derived risk variables;
  * labels.

Any future experiment can:

* reload this dataset;
* recompute derived features;
* reproduce all reported results.



### **5.8 Formal Closure of Notebook 01**

> **Notebook 01 establishes the empirical substrate of the project.**

From this point onward:

* Risk is no longer *simulated arbitrarily*;
* It is *learned, amplified, distorted, or controlled* by models.

This allows us to study **autonomous risk as an emergent property**, not a baked-in artifact.


## **Epistemic Role of Notebook 01**

This notebook functions as a controlled experimental substrate rather than a predictive benchmark. It establishes the minimal conditions under which autonomous risk can be meaningfully studied, ensuring that all downstream phenomena reflect model behavior and system dynamics rather than artifacts of data leakage or embedded intent.