# generate_obs_hte_26()

### Mathematical Specification of `generate_obs_hte_26()`

The `generate_obs_hte_26()` function generates an observational dataset with nonlinear outcome and treatment assignment mechanisms, along with heterogeneous treatment effects. The data generation process (DGP) is defined as follows:

#### 1. Confounders ($X$)
The dataset contains five confounders $X = (X_1, X_2, X_3, X_4, X_5)^T$:
- $X_1$: `tenure_months`
- $X_2$: `avg_sessions_week`
- $X_3$: `spend_last_month`
- $X_4$: `premium_user`
- $X_5$: `urban_resident`

**Base Features Sampling:**
The base features $(X_1, X_2, X_3, X_5)$ are sampled using a **Gaussian Copula** to introduce correlations while preserving specific marginal distributions. The correlation matrix for the underlying Gaussian variables is:
$$\Sigma = \begin{pmatrix} 1.0 & 0.3 & 0.2 & 0.0 \\ 0.3 & 1.0 & 0.4 & 0.0 \\ 0.2 & 0.4 & 1.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 1.0 \end{pmatrix}$$

The marginal distributions are:
- $X_1 \sim \text{Lognormal}(\mu = \ln 24, \sigma = 0.6)$, clipped at $[0, 120]$.
- $X_2 \sim \text{NegativeBinomial}(\text{mean} = 5, \text{dispersion} = 0.5)$, clipped at $[0, 40]$.
- $X_3 \sim \text{Lognormal}(\mu = \ln 60, \sigma = 0.9)$, clipped at $[0, 500]$.
- $X_5 \sim \text{Bernoulli}(p = 0.60)$.

**Derived Feature ($X_4$):**
The feature $X_4$ (`premium_user`) is generated based on a logistic model of the other features:
$$\text{logit}(P(X_4 = 1 | X)) = -5.0 + 0.7 \ln(1 + X_2) + 0.5 \ln(1 + X_3) + 0.01 X_1$$
$$X_4 \sim \text{Bernoulli}(P(X_4 = 1 | X))$$


#### 2. Treatment Assignment ($D$)
The treatment $D \in \{0, 1\}$ is assigned using a propensity score $e(X)$:
$$P(D=1|X) = \sigma\left( \alpha_d + \text{bound}\left( f_d(X), 2.0 \right) \right)$$
where:
- $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.
- $\text{bound}(z, c) = c \cdot \tanh(z/c)$ limits the propensity score range to ensure positivity (here $c=2.0$).
- $\alpha_d$ is a calibration constant chosen such that the overall treatment rate is approximately **35%**.
- The score $f_d(X)$ is defined as:
$$f_d(X) = 0.005 X_1 + 0.8 X_4 + 0.25 X_5 + 0.8 \tanh\left(\ln\frac{1+X_3}{61}\right) + 0.2 \ln\frac{1+X_2}{6} \tanh\left(\frac{X_1}{24} - 1\right) + 0.4 X_4 (X_5 - 0.5)$$


#### 3. Heterogeneous Treatment Effect ($\tau(X)$)
The treatment effect (CATE) is nonlinear and depends on the confounders:
$$\tau(X) = \text{clip}\left( 1.2 + 0.6 \tanh\left(\ln\frac{1+X_2}{6}\right) + 0.4 X_4 - 0.5 \tanh\left(\frac{X_1}{48}\right) + 0.2 X_5 \tanh\left(\ln\frac{1+X_3}{61}\right), 0.1, 3.0 \right)$$


#### 4. Outcome Model ($Y$)
The outcome is a continuous variable generated as:
$$Y = f_y(X) + D \cdot \tau(X) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 3.5^2)$$
The baseline outcome function $f_y(X)$ includes linear and nonlinear terms:
$$f_y(X) = 0.01 X_1 + 1.2 X_4 + 0.6 X_5 + g_y(X)$$
where:
$$g_y(X) = 1.5 \tanh\left(\frac{X_1}{24}\right) + 0.5 \left(\ln\frac{1+X_2}{6}\right)^2 + 0.2 \ln\frac{1+X_3}{61} \ln\frac{1+X_2}{6} + 0.5 X_4 \ln\frac{1+X_2}{6} + 0.8 X_5 \tanh\left(\frac{1}{2} \ln\frac{1+X_3}{61}\right)$$


In [None]:
from causalis.scenarios.unconfoundedness.dgp import generate_obs_hte_26

data = generate_obs_hte_26(return_causal_data=False, include_oracle=True)
data.head()

In [None]:
print(f"Ground truth ATE is {data['cate'].mean()}")
print(f"Ground truth ATTE is {data[data['d'] == 1]['cate'].mean()}")

In [None]:
from causalis.data_contracts import CausalData

causaldata = CausalData(df = data,
                        treatment='d',
                        outcome='y',
                        confounders=['tenure_months', 'avg_sessions_week', 'premium_user', 'urban_resident'])
causaldata

In [None]:
from causalis.shared import outcome_stats
outcome_stats(causaldata)

In [None]:
from causalis.shared import outcome_plots
outcome_plots(causaldata)