Skip to content
Permalink
Browse files

Updated README and debugging CI

  • Loading branch information...
emjun committed Apr 16, 2019
1 parent 3cf07f1 commit 66459ad17b68ac08435508ee4307f681509a770b
Showing with 97 additions and 56 deletions.
  1. +20 −0 CONTRIBUTING.md
  2. +9 −9 Pipfile.lock
  3. +65 −28 README.md
  4. +0 −2 tea/api.py
  5. +0 −1 tea/build.py
  6. +2 −3 tea/evaluate_helper_methods.py
  7. +1 −12 tests/integration_tests/test_integration.py
  8. +0 −1 tests/unit_tests/test_frontend.py
@@ -0,0 +1,20 @@
## FILES

## How do I use Tea?
Make sure that you have Python 3.7, pip (for Python 3.7), and pipenv (for Python 3.7) installed.
Start a pipenv: `pipenv shell`
From inside your environment, download all dependencies from Pipfile (`pipenv update`). This will take awhile because it builds Z3.
Add Tea to your Python path by creating `.env` file that has the following one-liner in it: `PYTHONPATH=${PYTHONPATH}:${PWD}`
To run tests and see output, run: `pytest tests/integration_tests/test_integration.py -s`

The main code base is written in Python and lives in the `tea` directory. The `tests` directory is used for developing and debugging and uses datasets in the `datasets` directory. Not all the datasets used in `tests/test_tea.py` are included in the `datasets` repository.
`tea/solver.py` contains the constraint solving module for both tests -> properties and properties -> tests.
`tea/ast.py` implements Tea's Abstract Syntax Tree (AST).
`tea/build.py` builds up Tea's AST for programs.
`tea/dataset.py` contains a runtime data structure that represents and contains the data the user provides.
`tea/evaluate.py` is the main interpreter file.
`tea/evaluate_helper_methods.py` contains the helper methods for `evaluate.py`.
`tea/evaluate_data_structures.py` contains the data structuers used in `evaluate.py` and `evaluate_helper_methods.py`.
`tea/errors.py` is empty. It will contain some code for providing helpful error messages.

**All of the above are still changing!**
@@ -209,18 +209,18 @@
},
"pyparsing": {
"hashes": [
"sha256:66c9268862641abcac4a96ba74506e594c884e3f57690a696d21ad8210ed667a",
"sha256:f6c5ef0d7480ad048c054c37632c67fca55299990fff127850181659eea33fc3"
"sha256:1873c03321fc118f4e9746baf201ff990ceb915f433f23b395f5580d1840cb2a",
"sha256:9b6323ef4ab914af344ba97510e966d64ba91055d6b9afa6b30799340e89cc03"
],
"version": "==2.3.1"
"version": "==2.4.0"
},
"pytest": {
"hashes": [
"sha256:13c5e9fb5ec5179995e9357111ab089af350d788cbc944c628f3cde72285809b",
"sha256:f21d2f1fb8200830dcbb5d8ec466a9c9120e20d8b53c7585d180125cce1d297a"
"sha256:3773f4c235918987d51daf1db66d51c99fac654c81d6f2f709a046ab446d5e5d",
"sha256:b7802283b70ca24d7119b32915efa7c409982f59913c1a6c0640aacf118b95f5"
],
"index": "pypi",
"version": "==4.4.0"
"version": "==4.4.1"
},
"python-dateutil": {
"hashes": [
@@ -231,10 +231,10 @@
},
"pytz": {
"hashes": [
"sha256:32b0891edff07e28efe91284ed9c31e123d84bea3fd98e1f72be2508f43ef8d9",
"sha256:d5f05e487007e29e03409f9398d074e158d920d36eb82eaf66fb1136b0c5374c"
"sha256:303879e36b721603cc54604edcac9d20401bdbe31e1e4fdee5b9f98d5d31dfda",
"sha256:d747dd3d23d77ef44c6a3526e274af6efeb0a6f1afd5a69ba4d5be4098c8e141"
],
"version": "==2018.9"
"version": "==2019.1"
},
"requests": {
"hashes": [
@@ -3,31 +3,68 @@
# [WIP] Tea: A High-level Language and Runtime System for Automating Statistical Analyses

## What is Tea?
Tea is a domain specific language for expressing the assertions and intentions/goals for statistical analyses (e.g., to compare two groups on an outcome). The user provides a dataset (currently only CSV) and a experimental design specification. The Tea runtime system then translates these high-level expressions, calculates properties about the data, and translates these properties into constarints to find a set of valid statistical tests. Tea uses Z3 as its constraint solver.

## How do I use Tea?
Make sure that you have Python 3.7, pip (for Python 3.7), and pipenv (for Python 3.7) installed.
Start a pipenv: `pipenv shell`
From inside your environment, download all dependencies from Pipfile (`pipenv update`). This will take awhile because it builds Z3.
Add Tea to your Python path by creating `.env` file that has the following one-liner in it: `PYTHONPATH=${PYTHONPATH}:${PWD}`
To run tests and see output, run: `pytest tests/integration_tests/test_integration.py -s`

The main code base is written in Python and lives in the `tea` directory. The `tests` directory is used for developing and debugging and uses datasets in the `datasets` directory. Not all the datasets used in `tests/test_tea.py` are included in the `datasets` repository.
`tea/solver.py` contains the constraint solving module for both tests -> properties and properties -> tests.
`tea/ast.py` implements Tea's Abstract Syntax Tree (AST).
`tea/build.py` builds up Tea's AST for programs.
`tea/dataset.py` contains a runtime data structure that represents and contains the data the user provides.
`tea/evaluate.py` is the main interpreter file.
`tea/evaluate_helper_methods.py` contains the helper methods for `evaluate.py`.
`tea/evaluate_data_structures.py` contains the data structuers used in `evaluate.py` and `evaluate_helper_methods.py`.
`tea/errors.py` is empty. It will contain some code for providing helpful error messages.

**All of the above are still changing!**

## Why Python?
Many comparable data science tools are implemented as Python. People seem to like and use Python a lot for data stuff!

## How can I contribute?
You want to contribute? That's so great! Thank you so much!

It would be best to create a separate branch that we can merge when Tea is slightly more stable. :)
Tea is a domain specific programming language that automates statistical test
selection and execution. Tea is currently written in/for Python.

Users provide 5 pieces of information:
* the *dataset* of interest,
* the *variables* in the dataset they want to analyze,
* the *study design* (e.g., independent, dependent variables),
* the *assumptions* they make about the data based on domain knowledge(e.g.,
a variable is normally distributed), and
* a *hypothesis*.

Tea then "compiles" these into logical constraints to select valid
statistical tests. Tests are considered valid if and only if *all* the
assumptions they make about the data (e.g., normal distribution, equal
variance between groups, etc.) hold. Tea then finally executes the valid tests.

## What kinds of statistical analyses are possible with Tea?
Tea currently provides a module to conduct Null Hypothesis Significance
Testing (NHST).

*We are actively working on expanding the kinds of analyses Tea can support. Some ideas we have: Bayesian inference and linear modeling.*

## How can I use Tea?
Tea will **very soon** be available on pip! Check back for updates :)

## How can I cite Tea?
For now, please cite our Arxiv paper:
```
article{JunEtAl2019:Tea,
title={Tea: A High-level Language and Runtime System for Automating Statistical Analysis},
author={Jun, Eunice and Daum, Maureen and Roesch, Jared and Chasins, Sarah E. and Berger, Emery D. and Just, Rene and Reinecke, Katharina},
journal={Arxiv},
year={2019},
publisher={Cornell University}
}
```

## How reliable is Tea?
Tea is currently a research prototype. Our constraint solver is based on
statistical texts (see <a href='https://arxiv.org/pdf/1904.05387.pdf'>our paper</a> for more info).

If you find any bugs, please let us know (email Eunice at emjun [at] cs.washington.edu)!

## I want to collaborate! Where do I begin?
This is great! We're excited to have new collaborators. :)

To contribute *code*, please see <a href='./CONTRIBUTING.md'> docs and
gudielines</a> and open an issue or pull request.

If you want to use Tea for a
project, talk about Tea's design, or anything else, please get in touch: emjun at
cs.washington.edu.

## Where can I learn more about Tea?
Please find more information at <a href='www.tea-lang.org'>our website</a>.

## I have ideas. I want to chat.
Please reach out! We are nice :): emjun at cs.washington.edu


### By the way, why Python?
Python is a common language for data science. We hope Tea can easily integrate
into user workflows.

*We are working on compiling Tea programs to different target languages, including R.*
@@ -132,7 +132,6 @@ def hypothesize(vars: list, prediction: str=None):
# Interpret AST node, Returns ResData object (?)
result = evaluate(dataset_obj, relationship, assumptions, study_design)
# all_results[relationship] -- How to check for multiple comparison problem?
# import pdb; pdb.set_trace()
print(f"\n{result}")
return result

@@ -190,7 +189,6 @@ def divine_properties(vars:list, tests:list):

# print("\nProperties for student's t test and Mann Whitney u test are complementary.")
print("\nProperties:")
import pdb; pdb.set_trace()
pp.pprint(test_to_properties)
print("\nProperties that could not be satisfied:")
pp.pprint(test_to_broken_properties)
@@ -305,7 +305,6 @@ def predict(vars: list, predictions: list=None):

# Prediction pertains to neither categorical nor numerical data
else:
import pdb; pdb.set_trace()
raise ValueError(f"Prediction is malformed: {p}")

formulated_predictions.append(pred)
@@ -699,7 +699,6 @@ def rm_one_way_anova(dataset: Dataset, design, combined_data: CombinedData):
# import pdb; pdb.set_trace()
aovrm2way = AnovaRM(data, depvar=y.metadata[name], subject=key, within=within_subjs, aggregate_func='mean')
# aovrm2way = AnovaRM(data, depvar=y.metadata[name], subject=dataset.pid_col_name, within=within_subjs, between=between_subjs) # apparently not implemented in statsmodels
import pdb; pdb.set_trace()
res2way = aovrm2way.fit()
return res2way

@@ -713,7 +712,7 @@ def kruskall_wallis(dataset: Dataset, predictions, combined_data: CombinedData):
data = []
for x in xs:
if x.metadata[categories] is None:
import pdb; pdb.set_trace()
raise ValueError('')
cat = [k for k,v in x.metadata[categories].items()]
for c in cat:
cat_data = dataset.select(y.metadata[name], where=[f"{x.metadata[name]} == '{c}'"])
@@ -740,7 +739,7 @@ def friedman(dataset: Dataset, predictions, combined_data: CombinedData):
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html
# Parameters: x (array-like) | y (array-like)
def linear_regression(iv: VarData, dv: VarData, predictions: list, comp_data: CombinedData, **kwargs):
import pdb; pdb.set_trace()
# import pdb; pdb.set_trace()
return stats.linregress(iv.dataframe, dv.dataframe)

# def bootstrap(data):
@@ -227,7 +227,7 @@ def test_kendall_tau_corr():
print("\nfrom Field et al.")
print("Expected outcome: Kendall Tau")
log("Kendall Tau", results)
import pdb; pdb.set_trace()
# import pdb; pdb.set_trace()
# Returns: Kendall Tau and Pearson

def test_pointbiserial_corr():
@@ -270,7 +270,6 @@ def test_pointbiserial_corr():
tea.hypothesize(['time', 'gender'], ['gender:1 > 0']) # I think this works!?
print("\nfrom Field et al.")
print("Expected outcome: Pointbiserial")
import pdb; pdb.set_trace()
# Results: {'mannwhitney_u': MannwhitneyuResult(statistic=262.0,
# pvalue=0.0058742825311290285), 'kruskall_wallis':
# KruskalResult(statistic=7.629432016171138, pvalue=0.00574233835210006)}
@@ -316,7 +315,6 @@ def test_indep_t_test():
tea.hypothesize(['So', 'Prob'], ['So:1 > 0']) ## Southern is greater
print("\nfrom Kabacoff")
print("Expected outcome: Student's t-test")
import pdb; pdb.set_trace()
# Results: {'students_t': Ttest_indResult(statistic=-4.202130736875173, pvalue=0.00012364897266532775), 'welchs_t': Ttest_indResult(statistic=-3.8953717090736655, pvalue=0.0006505783178002014), 'mannwhitney_u': MannwhitneyuResult(statistic=81.0, pvalue=0.00018546387565891538), 'f_test': df sum_sq mean_sq F PR(>F)
# C(So) 1.0 0.006702 0.006702 17.657903 0.000124
# Residual 45.0 0.017079 0.000380 NaN NaN, 'kruskall_wallis': KruskalResult(statistic=14.056955645161281, pvalue=0.00017735665596242664), 'factorial_ANOVA': df sum_sq mean_sq F PR(>F)
@@ -366,7 +364,6 @@ def test_paired_t_test():

print("\nfrom Field et al.")
print("Expected outcome: Paired/Dependent t-test")
import pdb; pdb.set_trace()
# Results: {'pointbiserial_corr_a': True, 'paired_students_t': Ttest_relResult(statistic=-2.472533427497901, pvalue=0.030981783136040896), 'wilcoxon_signed_rank': WilcoxonResult(statistic=8.0, pvalue=0.045855524379089546), 'rm_one_way_anova': True, 'factorial_ANOVA': df sum_sq mean_sq F PR(>F)
# C(Group) 1.0 294.0 294.0 2.826923 0.106839
# Residual 22.0 2288.0 104.0 NaN NaN}
@@ -416,8 +413,6 @@ def test_wilcoxon_signed_rank():

print("\nfrom Field et al.")
print("Expected outcome: Wilcoxon signed rank test")
import pdb; pdb.set_trace()
# Results: {'pointbiserial_corr_a': True, 'wilcoxon_signed_rank': WilcoxonResult(statistic=8.0, pvalue=0.04656776138686537)}

def test_f_test():
print("\nFrom Field et al.")
@@ -457,7 +452,6 @@ def test_f_test():
tea.hypothesize(['trt', 'response'])
print("\nFrom Field et al.")
print("Expected outcome: Oneway ANOVA (F) test")
import pdb; pdb.set_trace()

def test_kruskall_wallis():
print("\nFrom Field et al.")
@@ -498,7 +492,6 @@ def test_kruskall_wallis():

print("\nFrom Field et al.")
print("Expected outcome: Kruskall Wallis")
import pdb; pdb.set_trace()

def test_rm_one_way_anova():
print("\nFrom Field et al.")
@@ -543,7 +536,6 @@ def test_rm_one_way_anova():

print("\nFrom Field et al.")
print("Expected outcome: Repeated Measures One Way ANOVA")
import pdb; pdb.set_trace()

def test_factorial_anova():
print("\nFrom Field et al.")
@@ -587,7 +579,6 @@ def test_factorial_anova():
# alcohol main effect?
print("\nFrom Field et al.")
print("Expected outcome: Factorial ANOVA")
import pdb; pdb.set_trace()
"""
def test_factorial_anova_2():
print("\nFrom Field et al.")
@@ -678,7 +669,6 @@ def test_two_way_anova():
tea.hypothesize(['uptake', 'conc', 'Type']) # Fails: not all groups are normal
#Type main effect?
print('Supposed to be 2 way ANOVA')
import pdb; pdb.set_trace()



@@ -714,4 +704,3 @@ def test_chi_square():

tea.hypothesize(['Training', 'Dance'])
print('Chi square')
import pdb; pdb.set_trace()
@@ -246,7 +246,6 @@ def test_index_in_dataset():
url = 'https://homes.cs.washington.edu/~emjun/tea-lang/datasets/mtcars.csv'
url_mini = '/Users/emjun/Git/tea-lang/datasets/mini_test.csv'
ds = load_data_from_url(url_mini, 'mini_test')
import pdb; pdb.set_trace()
ds = load_data(file_path, variables, 'participant_id')

# Bivariate test, between subjects

0 comments on commit 66459ad

Please sign in to comment.
You can’t perform that action at this time.