amazon-science · thomaspinder · Aug 22, 2024 · Aug 22, 2024 · Aug 22, 2024 · Aug 22, 2024
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,11 @@
+## Checklist
+
+- [ ] I've formatted the new code by running `hatch run dev:format` before committing.
+- [ ] I've added tests for new code.
+- [ ] I've added docstrings for the new code.
+
+## Description
+
+Please describe your changes here. If this fixes a bug, please link to the issue, if possible.
+
+Issue Number: N/A
diff --git a/.github/workflows/ruff.yml b/.github/workflows/ruff.yml
@@ -0,0 +1,12 @@
+name: Check linting
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3.5.2
+      - uses: chartboost/ruff-action@v1
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -0,0 +1,34 @@
+name: Run Tests
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+
+jobs:
+  unit-tests:
+    name: Run Tests
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        # Select the Python versions to test against
+        os: ["ubuntu-latest", "macos-latest"]
+        python-version: ["3.10", "3.11"]
+      fail-fast: true
+    steps:
+      - name: Check out the code
+        uses: actions/checkout@v3.5.2
+        with:
+          fetch-depth: 1
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      # Install Hatch
+      - name: Install Hatch
+        uses: pypa/hatch@install
+
+      # Run the unit tests and build the coverage report
+      - name: Run Tests
+        run: hatch run dev:test
diff --git a/README.md b/README.md
@@ -1,17 +1,64 @@
-## My Project
+# SyntheticCausalDataGen
 
-TODO: Fill this README out!
+This package provides functionality to define your own causal data generation process and then simulate data from the process. Within the package, there is functionality to include complex components to your process, such as periodic and temporal trends, and all of these operations are fully composable with one another. 
 
-Be sure to:
+A short example is given below
+```python
+from causal_validation import Config, simulate
+from causal_validation.effects import StaticEffect
+from causal_validation.plotters import plot
+from causal_validation.transforms import Trend, Periodic
+from causal_validation.transforms.parameter import UnitVaryingParameter
+from scipy.stats import norm
 
-* Change the title in this README
-* Edit your repository description on GitHub
+cfg = Config(
+    n_control_units=10,
+    n_pre_intervention_timepoints=60,
+    n_post_intervention_timepoints=30,
+)
 
-## Security
+# Simulate the base observation
+base_data = simulate(cfg)
 
-See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
+# Apply a linear trend with unit-varying intercept
+intercept = UnitVaryingParameter(sampling_dist = norm(0, 1))
+trend_component = Trend(degree=1, coefficient=0.1, intercept=intercept)
+trended_data = trend_component(base_data)
 
-## License
+# Simulate a 5% lift in the treated unit's post-intervention data
+effect = StaticEffect(0.05)
+inflated_data = effect(trended_data)
 
-This project is licensed under the Apache-2.0 License.
+# Plot your data
+plot(inflated_data)
+```
 
+
+## Examples
+
+To supplement the above example, we have two more detailed notebooks which exhaustively present and explain the functionalty in this package, along with how the generated data may be integrated with [AZCausal](https://github.com/amazon-science/azcausal).
+1. [Basic notebook](): We here show the full range of available functions for data generation
+2. [AZCausal notebook](): We here show how the generated data may be used within an AZCausal model.
+
+## Installation
+
+In this section we guide the user through the installation of this package. We distinguish here between _users_ of the package who seek to define their own data generating processes, and _developers_ who wish to extend the existing functionality of the package.
+
+### Prerequisites
+
+- Python 3.10 or higher
+- [Poetry](https://python-poetry.org/) (optional, but recommended)
+
+### For Users
+
+1. It's strongly recommended to use a virtual environment. Create and activate one using your preferred method before proceeding with the installation.
+2. Clone the package `git clone git@github.com:amazon-science/causal-validation.git`
+3. Enter the package's root directory `cd SyntheticCausalDataGen`
+4. Install the package `pip install -e .`
+
+### For Developers
+
+1. Follow steps 1-3 from `For Users`
+2. Create a hatch environment `hatch env create`
+3. Open a hatch shell `hatch shell`
+4. Validate your installation by running `hatch run tests:test`
diff --git a/examples/azcausal.pct.py b/examples/azcausal.pct.py
@@ -0,0 +1,113 @@
+# %%
+from azcausal.estimators.panel.sdid import SDID
+import scipy.stats as st
+
+from causal_validation import (
+    Config,
+    simulate,
+)
+from causal_validation.effects import StaticEffect
+from causal_validation.plotters import plot
+from causal_validation.transforms import (
+    Periodic,
+    Trend,
+)
+from causal_validation.transforms.parameter import UnitVaryingParameter
+
+# %% [markdown]
+# ## AZCausal Integration
+#
+# Amazon's [AZCausal](https://github.com/amazon-science/azcausal) library provides the
+# functionality to fit synthetic control and difference-in-difference models to your
+# data. Integrating the synthetic data generating process of `causal_validation` with
+# AZCausal is trivial, as we show in this notebook. To start, we'll simulate a toy
+# dataset.
+
+# %%
+cfg = Config(
+    n_control_units=10,
+    n_pre_intervention_timepoints=60,
+    n_post_intervention_timepoints=30,
+    seed=123,
+)
+
+linear_trend = Trend(degree=1, coefficient=0.05)
+data = linear_trend(simulate(cfg))
+plot(data)
+
+# %% We'll now simulate a 5% lift in the treatment group's observations. This [markdown]
+# will inflate the treated group's observations in the post-intervention window.
+
+# %%
+TRUE_EFFECT = 0.05
+effect = StaticEffect(effect=TRUE_EFFECT)
+inflated_data = effect(data)
+plot(inflated_data)
+
+# %% [markdown]
+# ### Fitting a model
+#
+# We now have some very toy data on which we may apply a model. For this demonstration
+# we shall use the Synthetic Difference-in-Differences model implemented in AZCausal;
+# however, the approach shown here will work for any model implemented in AZCausal. To
+# achieve this, we must first coerce the data into a format that is digestible for
+# AZCausal. Through the `.to_azcausal()` method implemented here, this is
+# straightforward to achieve. Once we have a AZCausal compatible dataset, the modelling
+# is very simple by virtue of the clean design of AZCausal.
+
+# %%
+panel = inflated_data.to_azcausal()
+model = SDID()
+result = model.fit(panel)
+print(f"Delta: {TRUE_EFFECT - result.effect.percentage().value / 100}")
+print(result.summary(title="Synthetic Data Experiment"))
+
+# %% We see that SDID has done an excellent job of estimating the treatment [markdown]
+# effect.  However, given the simplicity of the data, this is not surprising. With the
+# functionality within this package though we can easily construct more complex datasets
+# in effort to fully stress-test any new model and identify its limitations.
+#
+# To achieve this, we'll simulate 10 control units, 60 pre-intervention time points, and
+# 30 post-intervention time points according to the following process: $$ \begin{align}
+# \mu_{n, t} & \sim\mathcal{N}(20, 0.5^2)\\
+# \alpha_{n} & \sim \mathcal{N}(0, 1^2)\\
+# \beta_{n} & \sim \mathcal{N}(0.05, 0.01^2)\\
+# \nu_n & \sim \mathcal{N}(1, 1^2)\\
+# \gamma_n & \sim \operatorname{Student-t}_{10}(1, 1^2)\\
+# \mathbf{Y}_{n, t} & = \mu_{n, t} + \alpha_{n} + \beta_{n}t + \nu_n\sin\left(3\times
+# 2\pi t + \gamma\right) + \delta_{t, n} \end{align} $$ where the true treatment effect
+# $\delta_{t, n}$ is 5% when $n=1$ and $t\geq 60$ and 0 otherwise. Meanwhile,
+# $\mathbf{Y}$ is the matrix of observations, long in the number of time points and wide
+# in the number of units.
+
+# %%
+cfg = Config(
+    n_control_units=10,
+    n_pre_intervention_timepoints=60,
+    n_post_intervention_timepoints=30,
+    global_mean=20,
+    global_scale=1,
+    seed=123,
+)
+
+intercept = UnitVaryingParameter(sampling_dist=st.norm(loc=0.0, scale=1))
+coefficient = UnitVaryingParameter(sampling_dist=st.norm(loc=0.05, scale=0.01))
+linear_trend = Trend(degree=1, coefficient=coefficient, intercept=intercept)
+
+amplitude = UnitVaryingParameter(sampling_dist=st.norm(loc=1.0, scale=2))
+shift = UnitVaryingParameter(sampling_dist=st.t(df=10))
+periodic = Periodic(amplitude=amplitude, shift=shift, frequency=3)
+
+data = effect(periodic(linear_trend(simulate(cfg))))
+plot(data)
+
+# %% As before, we may now go about estimating the treatment. However, this [markdown]
+# time we see that the delta between the estaimted and true effect is much larger than
+# before.
+
+# %%
+panel = data.to_azcausal()
+model = SDID()
+result = model.fit(panel)
+print(f"Delta: {100*(TRUE_EFFECT - result.effect.percentage().value / 100): .2f}%")
+print(result.summary(title="Synthetic Data Experiment"))