Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exp api #25

Merged
merged 31 commits into from
Aug 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
63bd474
In middle of refactoring variable API
emjun Jul 20, 2021
eeffddb
Add notes and comments from Rene meeting
emjun Jul 22, 2021
cbfc440
Update gitignore
emjun Jul 22, 2021
bdfb980
Add overview, initial pass of corresponding Tisane program.
emjun Jul 23, 2021
028c2d4
Revise overview notes and ideas
emjun Jul 23, 2021
4c67674
Update test graph tests
emjun Aug 2, 2021
f8a1a6b
Add to examples
emjun Aug 2, 2021
03ba9ae
Remove comments
emjun Aug 2, 2021
64c49d4
Update test case
emjun Aug 2, 2021
fbca408
Add Radon example files
emjun Aug 2, 2021
9246b92
Add comments and notes for next steps
emjun Aug 3, 2021
7c0bc85
Edit example Tisane programs, include comments that translate the lin…
emjun Aug 3, 2021
200b726
Refactoring
emjun Aug 3, 2021
d2dc160
Add basic tikz graph creation (#24)
audreyseo Aug 3, 2021
79fd751
Refactoring API
emjun Aug 4, 2021
b9c45b8
Add SetUp type, update test_variable tests to match new API, debug
emjun Aug 4, 2021
149f383
Writing and re-implementing graph inference algorithms
emjun Aug 4, 2021
29e6a8f
Add text/pseudocode for inference rules and tests
emjun Aug 5, 2021
a82e6ed
Start implementation model effect inference
emjun Aug 5, 2021
4055ccb
Add first pass implementation of main effects generation
emjun Aug 5, 2021
852cedb
Add tests and debug for finding common causal ancestors
emjun Aug 6, 2021
5420a16
Add tests for finding causal ancestors
emjun Aug 6, 2021
4b0af8b
Add more tests for finding variables associated with ivs that also ca…
emjun Aug 6, 2021
54d2870
Add more tests
emjun Aug 6, 2021
5a0f369
Add tests for finding parents that cause an IV and also cause or are …
emjun Aug 6, 2021
c2370a3
Update method name to match rule
emjun Aug 6, 2021
16cedbd
Update infer main effects method
emjun Aug 6, 2021
9852f6b
Add ivs already included in query to main effects candidates
emjun Aug 6, 2021
79b3274
Add comment and return type to function signature for main effects in…
emjun Aug 6, 2021
8a9d71b
Update README for new API
emjun Aug 7, 2021
d7e7d91
Merge branch 'main' into exp_api
emjun Aug 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,4 @@ dmypy.json

# Pyre type checker
.pyre/
.Rproj.user
94 changes: 68 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,70 @@
# tisane
Mixed Initiative System for Linear Modeling

Interface + DSL + Knowledge Base

DSL and Knowledge Base:
1. Effects sets generation
- Interaction: Select which sets of variables the end-user is interested in.
2. Dynamically generate the constraints according to the set of variables. (in a for-loop)
- The next step happens in a loop
- For the next step, need a way to store results of vis for anything that is not specific to a set of effects... (can be reused)
3. Property verification for selected sets.
- Interaction: Variable data types
- Interaction:
- If have no data: assumptions that hold (checkboxes or something)
- If have data: Visualization for all residual properties.
4. Steps 1 and 2 generate constraints that are used to figure out the set of statistical models that hold.
5. Parse Knowledge Base output into statistical scripts for each set of effects! (Draco example: https://github.com/uwdata/draco/blob/master/js/src/asp2vl.ts)

Important files:
tests/sample.py : Sample program.

# Design Rationale
Design API
- Definitions: https://docs.google.com/document/d/16b1u8WUQFcYdcswJJqIjKzV11VLG2p7ozFQkYX51JaI/edit
- Early prototype feedback: https://gist.github.com/emjun/97acf6666ed6d4d457efa2edf55eee86
- Formative, related work: https://docs.google.com/document/d/1LGfZ3_WKsyGOeZecA-w-cFqDslFRYdqlKCyo6fgdexM/edit
Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships

TL;DR: Analysts can use Tisane to author generalized linear models with or without mixed effects. Tisane infers statistical models from variable relationships (from domain knowledge) that analysts specify. By doing so, Tisane helps analysts avoid common threats to external and statistical conclusion validity. Analysts do not need to be statistical experts!

Tisane provides (i) a graph specification language for expressing relationships between variables and (ii) an interactive query and compilation process for inferring a valid statistical model from a set of variables in the graph.

## Graph specification language
### Variables
There are three types of variables: (i) Units, (ii) Measures, and (iii) SetUp, or environmental, variables.
- ``Unit`` types represent entities that are observed (``observed units`` in the experimental design literature) or the recipients of experimental conditions (``experimental units``).
Ex:
- ``Measure`` types represent attributes of units that are proxies of underlying constructs. Measures can have one of the following data types: numeric, nominal, or ordinal. Numeric measures have values that lie on an interval or ratio scale. Nominal measures are categorical variables without an ordering between categories. Ordinal measures are categorical variables with an ordering between categories.
Ex:
- ``SetUp`` types represent study or experimental settings that are global and unrelated to any of the units involved. For example, time is often an environmental variable that differentiates repeated measures but is neither a unit nor a measure.
Ex:

Design rationale: We derived this type system from how other software tools focused on study design separate their concerns.

### Relationships between variables
Analysts can use Tisane to express (i) conceptual and (ii) data measurement relationships between variables.

There are three different types of conceptual relationships.
- A variable can *cause* another variable.
- A variable can be *associated with* another variable.
- One or more variables can *moderate* the effect of a variable on another variable.
Currently, a variable, V1, can have a moderated relationship with a variable, V2, without also having a causal or associative relationship with V2. --> Should this be the case? Initially, I though that moderation (which is later translated into an interaction effect) is another way/type of how two variables relate. At the same time, if V1 has a moderated effec on V2, it is implied that V1 has some kind of associative relationship with V2.

These relationships are used to construct an internal graph representation of variables and their relationships with one another.

## Query
Analysts query the relationships they have specified (technically, the internal graph represenation) for a statistical model. For each query, analysts must specify (i) a dependent variable to explain using (ii) a set of independent variables.

Query validation: To be a valid query, Tisane verifies that the dependent variable does not cause an independent variable. It would be conceptually incorrect to explain a cause from an effect.

## Statistical model inference
After validating a query, Tisane traverses the internal graph representation in order to generate candidate generalized linear models with or without mixed effects. A generalized linear model consists of a model effects structure and a family/link function pair.

### Model effects structure
<!-- generate possible statistical model effects structures and family/link functions. -->
Tisane generates candidate main effects, interaction effects, and, if applicable, random effects based on analysts' expressed relationships.

- Tisane aims to direct analysts' attention to variables, especially possible confounders, that the analyst may have overlooked. When generating main effects candidates, Tisane looks for other variables in the graph that may exert causal influence on the dependent variable and are related to the input independent variables.
- Tisane aims to represent conceptual relationships between variables accurately. Based on the main effects analysts choose to include in their output statistical model, Tisane suggests interaction effects to include. Tisane relies on the moderate relationships analysts specified in their input program to infer interaction effects.
- Tisane aims to increase the generalizability of statistical analyses and results by automatically detecting the need for and including random effects. Tisane follows the guidelines outlined in [] and [] to generat the maximal random effects structure.

[INFERENCE.md](tisane/INFERENCE.md) explains all inference rules in greater detail.

### Family/link function
Family and link functions depend on the data types of dependent variables and their distributions.

Based on the data type of the dependent variable, Tisane suggests matched pairs of possible family and link functions to consider. Tisane ensures that analysts consider only valid pairs of family and link functions.

## Interaction model
A key aspect of Tisane that distinguishes it from other systems, such as [Tea](), is the importance of user interaction in guiding the statistical model that is inferred as output and ultimately fit.

Tisane generates a space of candidate statistical models and asks analysts disambiguation questions for (i) including additional main or interaction effects and, if applicable, correlating (or uncorrelating) random slopes and random intercepts as well as (ii) selecting among viable family/link function pairs.

To help analysts, Tisane provides text explanations and visualizations. For example, to show possible family functions, Tisane simulates data to fit a family function and visualizes it on top of a histogram of the analyst's data and explains to the how to use the visualization to compare family functions.

Would love to include/looking into: Tisane also provides the results of statistical tests (e.g., Shapiro-Wilk) that test if the underlying data follows a family distribution.

<!-- Two aspects:
- generating the space
- narrowing the space -->

### Tisane helps analysts avoid common threats to external and statistical conclusion validity.
TODO: Specifically, Tisane helps....

Tisane elicits conceptual and data measurement relationships between variables. Tisane considers and asks users during disambiguation only about conceptually plausible effects based on variable relationships, and plausible family/link functions based on variable data types. This means that all candidate models are conceptually justifiable. Tisane also does not show model results during disambiguation, preventing users from providing answers that lead to desired findings and p-values. As a result, Tisane discourages cherry-picking (e.g., p-hacking). It would likely be easier to “p-hack” a linear model using existing tools because to “p-hack” Tisane, a user would have to manipulate their conceptual model and/or the underlying model generation process. Locking (see below) is another way to discourage cherry-picking.