Merge pull request #40 from georgianpartners/issue_33

Implement basic integration tests
georgian-io-archive · Feb 20, 2019 · 6896f1d · 6896f1d
2 parents 7f5ce68 + 787a063
commit 6896f1d
Show file tree

Hide file tree

Showing 5 changed files with 117 additions and 8 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -24,6 +24,9 @@ script:
 after_success:
   - poetry run coveralls
 
+env:
+  - FORESHADOW_TESTS="ALL"
+
 jobs:
   include:
     - python: "3.5"
@@ -35,4 +38,4 @@ jobs:
         - pip install pre-commit
         - pre-commit install-hooks
       script:
-        - pre-commit run --all-files
+        - pre-commit run --all-files
diff --git a/doc/developers.rst b/doc/developers.rst
@@ -109,6 +109,8 @@ Making sure everything works
    
    If all the tests pass you're all set up!
 
+.. note:: Our platform also includes integration tests that asses the overall performance of our framework using the default settings on a few standard ML datasets. By default these tests are not executed, to run them, set an environmental variable called `FORESHADOW_TESTS` to `ALL` 
+
 Suggested development work flow
    1. Create a branch off of development to contain your change
 
@@ -199,10 +201,14 @@ Intents are where the magic of Foreshadow all comes together. You need to be tho
 
 You will need to set the :py:attr:`dtype <foreshadow.intents.BaseIntent.dtype>`, :py:attr:`children <foreshadow.intents.BaseIntent.children>`, :py:attr:`single_pipeline <foreshadow.intents.BaseIntent.single_pipeline>`, and :py:attr:`multi_pipeline <foreshadow.intents.BaseIntent.multi_pipeline>` class attributes. You will also need to implement the :py:meth:`is_intent <foreshadow.intents.BaseIntent.is_intent>` classmethod. In most cases when adding an intent you can initialize :py:attr:`children <foreshadow.intents.BaseIntent.children>` to an empty list. Set the :py:attr:`dtype <foreshadow.intents.BaseIntent.dtype>` to the most appropriate initial form of that entering your intent.
 
-Use the :py:attr:`single_pipeline <foreshadow.intents.BaseIntent.single_pipeline>` field to determine the transformers that will be applied to a **single** column that is mapped to your intent. Add a **unique** name describing each step that you choose to include in your pipeline. It is important to note the utility of smart transformers here as you can now include branched logic in your pipelines deciding between different individual transformers based on the input data at runtime. The :py:attr:`multi_pipeline <foreshadow.intents.BaseIntent.multi_pipeline>` pipeline should be used to apply transformations to all columns of a specific  intent after the single pipelines have been evaluated. The same rules for defining the pipelines themselves apply here as well.
+Use the :py:attr:`single_pipeline <foreshadow.intents.BaseIntent.single_pipeline>` field to determine the transformers that will be applied to a **single** column that is mapped to your intent. Add a **unique** name describing each step that you choose to include in your pipeline. This field is represented as a list of PipelineTemplateEntry objects which are constructed using the following format `PipelineTemplateEntry([unique_name], [class], [can_operate_on_y])` The class name is either a singular transformer class, or a tuple of the form `([cls], {**args})` where args will be passed into the constructor of the transformer. The final boolean determines whether that transformer should be applied when operating on y-variables.
+
+It is important to note the utility of smart transformers here as you can now include branched logic in your pipelines deciding between different individual transformers based on the input data at runtime. The :py:attr:`multi_pipeline <foreshadow.intents.BaseIntent.multi_pipeline>` pipeline should be used to apply transformations to all columns of a specific  intent after the single pipelines have been evaluated. The same rules for defining the pipelines themselves apply here as well.
 
 The :py:meth:`is_intent <foreshadow.intents.BaseIntent.is_intent>` classmethod determines whether a specific column maps to an intent. Use this method to apply any heuristics, logic, or methods of determine whether a raw column maps to the intent that you are defining. Below is an example intent definition that you can modify to suit your needs.
 
+The :py:meth:`column_summary <foreshadow.intents.BaseIntent.column_summary>` classmethod is used to generate statistical reports each time an intent operates on a columns allowing a user to examine how effective the intent will be in processing the data. These reports can be accessed by calling the :py:meth:`summarize <foreshadow.preprocessor.summarize>` method after fitting the Foreshadow object. 
+
 Make **sure** to go to the parent intent and add your intent class name to the ordered :py:attr:`children <foreshadow.intents.BaseIntent.children>` field in the order of priority among the previously defined intents. The last intent in this list will be the most preferred intent upon evaluation in the case of multiple intents being able to process a column.
 
 Take a look at the :py:class:`NumericIntent <foreshadow.intents.NumericIntent>` implementation for an example of how to implement an intent.
@@ -211,4 +217,4 @@ Take a look at the :py:class:`NumericIntent <foreshadow.intents.NumericIntent>`
 Future Architecture Roadmap
 ---------------------------
 
-Under progress
+In progress
diff --git a/doc/users.rst b/doc/users.rst
@@ -336,6 +336,8 @@ other than the :code:`override` parameter itself will be passed to the override
 To use a smart transformer outside of the Intent / Foreshadow environment simply use it exactly as a sklearn transformer. When you call :code:`fit()` or :code:`fit_transform()` it automatically
 resolves which transformer to use by interally calling the :code:`_get_transformer()` overriden method.
 
+.. note:: Arguments passed into the constructor of a smart transformer will be passed into the fit function of the transformer it resolves to. This is meant to primarily be used alongside the override argument.
+
 
 Configuration
 -------------

diff --git a/foreshadow/intents/base.py b/foreshadow/intents/base.py
@@ -67,16 +67,17 @@ class BaseIntent(metaclass=_IntentRegistry):
 
     single_pipeline_template = None
     """A template for single pipelines of smart transformers that affect a 
-        single column in an intent
+        single column in an intent. Uses a list of PipelineTemplateEntry to
+        describe the transformers.
 
-        The template needs an additional boolean at the end of the tuple that
+        The template needs an additional boolean at the end of the constructor that
         determines whether the transformation can be applied to response 
         variables.
     
         Example: single_pipeline_template = [
-            ('t1', Transformer1, False),
-            ('t2', (Transformer2, {'arg1': True}), True),
-            ('t3', Transformer1, True),
+            PipelineTemplateEntry('t1', Transformer1, False),
+            PipelineTemplateEntry('t2', (Transformer2, {'arg1': True}), True),
+            PipelineTemplateEntry('t3', Transformer1, True),
         ]
     """
 

diff --git a/foreshadow/tests/test_integration.py b/foreshadow/tests/test_integration.py
@@ -0,0 +1,97 @@
+"""
+Integration Tests
+
+Slow-running tests that verify the performance of the framework on simple datasets
+"""
+
+import pytest
+
+
+def check_slow():
+    import os
+
+    return os.environ.get("FORESHADOW_TESTS") != "ALL"
+
+
+slow = pytest.mark.skipif(
+    check_slow(), reason="Skipping long-runnning integration tests"
+)
+
+
+@slow
+def test_integration_binary_classification():
+    import foreshadow as fs
+    import pandas as pd
+    import numpy as np
+    from sklearn.datasets import load_breast_cancer
+    from sklearn.model_selection import train_test_split
+    from sklearn.linear_model import LogisticRegression
+
+    np.random.seed(1337)
+
+    cancer = load_breast_cancer()
+    cancerX_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
+    cancery_df = pd.DataFrame(cancer.target, columns=["target"])
+
+    X_train, X_test, y_train, y_test = train_test_split(
+        cancerX_df, cancery_df, test_size=0.2
+    )
+    shadow = fs.Foreshadow(estimator=LogisticRegression())
+    shadow.fit(X_train, y_train)
+
+    baseline = 0.9824561403508771
+    score = shadow.score(X_test, y_test)
+
+    assert not score < baseline * 0.9
+
+
+@slow
+def test_integration_multiclass_classification():
+    import foreshadow as fs
+    import numpy as np
+    import pandas as pd
+    from sklearn.datasets import load_iris
+    from sklearn.model_selection import train_test_split
+    from sklearn.linear_model import LogisticRegression
+
+    np.random.seed(1337)
+
+    iris = load_iris()
+    irisX_df = pd.DataFrame(iris.data, columns=iris.feature_names)
+    irisy_df = pd.DataFrame(iris.target, columns=["target"])
+
+    X_train, X_test, y_train, y_test = train_test_split(
+        irisX_df, irisy_df, test_size=0.2
+    )
+    shadow = fs.Foreshadow(estimator=LogisticRegression())
+    shadow.fit(X_train, y_train)
+
+    baseline = 0.9666666666666667
+    score = shadow.score(X_test, y_test)
+
+    assert not score < baseline * 0.9
+
+
+@slow
+def test_integration_regression():
+    import foreshadow as fs
+    import numpy as np
+    import pandas as pd
+    from sklearn.datasets import load_boston
+    from sklearn.model_selection import train_test_split
+    from sklearn.linear_model import LinearRegression
+
+    boston = load_boston()
+    bostonX_df = pd.DataFrame(boston.data, columns=boston.feature_names)
+    bostony_df = pd.DataFrame(boston.target, columns=["target"])
+
+    X_train, X_test, y_train, y_test = train_test_split(
+        bostonX_df, bostony_df, test_size=0.2
+    )
+    shadow = fs.Foreshadow(estimator=LinearRegression())
+    shadow.fit(X_train, y_train)
+
+    baseline = 0.6953024611269096
+    score = shadow.score(X_test, y_test)
+
+    assert not score < baseline * 0.9