Merge pull request #22 from georgianpartners/issue_10

Implemented configuration change resolving issue #10
georgian-io-archive · Jan 4, 2019 · be7d368 · be7d368
2 parents 33eee1c + c3e80a6
commit be7d368
Show file tree

Hide file tree

Showing 19 changed files with 287 additions and 161 deletions.
diff --git a/doc/users.rst b/doc/users.rst
@@ -351,23 +351,25 @@ An example configuration for processing the Boston Housing dataset is below. We
 
     {
       "columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
+        "crim":{"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                ]},
+        "indus":{"intent": "GenericIntent"}
       },
 
       "postprocess":[
-        ["pca",["age"],[
-          ["PCA", "PCA", {"n_components":2}]
-        ]]
+        {"name":"pca",
+         "columns": ["age"],
+         "pipeline": [
+          {"transformer": "PCA", "name": "PCA", "parameters": {"n_components":2}}
+        ]}
       ],
 
       "intents":{
         "NumericIntent":{
           "single":[
-            ["Imputer", "impute", {"strategy":"mean"}]
+            {"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
           ],
           "multi":[]
         }
@@ -383,32 +385,33 @@ Column Override
 
 .. code-block:: json
 
-      {"columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
-      }}
+      "columns":{
+        "crim":{"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                ]},
+        "indus":{"intent": "GenericIntent"}
+      }
 
 This section is a dictionary containing two keys, each of which are columns in the Boston Housing set. First we will look at the value
-of the :code:`"crim"` key which is a list.
+of the :code:`"crim"` key which is a dict.
 
 
 .. code-block:: json
+        
+        {"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+        ]}
 
-    ["GenericIntent",[
-        ["StandardScaler", "Scaler", {"with_mean":false}]
-    ]]
-
-The list is of form :code:`[intent_name, pipeline]`. Here we can see that this column has been assigned the intent :code:`"GenericIntent`
-and the pipeline :code:`[["StandardScaler", "Scaler", {"with_mean":false}]]`
+Here we can see that this column has been assigned the intent :code:`"GenericIntent`
+and the pipeline :code:`[{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}]`
 
 This means that regardless of how Preprocessor automatically assigns Intents, the intent GenericIntent will always be assigned to the crim column.
 It also means that regardless of what intent is assigned to the column (this value is still important for multi-pipelines), the Preprocessor will always
 use this hard-coded pipeline to process that column. The column would still be processed by its initially identifited multi-pipeline unless explicitly overridden.
 
-The pipeline itself is defined by the following standard :code:`[[class, name, {param_key: param_value, ...}], ...]`
+The pipeline itself is defined by the following standard :code:`[{"transformer":class, "name":name, "parameters":{param_key: param_value, ...}], ...]`
 When preprocessor parses this configuration it will create a Pipeline object with the given transformers of the given class, name, and parameters.
 For example, the preprocessor above will look something like :code:`sklearn.pipeline.Preprocessor([('Scaler', StandardScaler(with_mean=False)))])`
 Any class implementing the sklearn Transformer standard (including SmartTransformer) can be used here.
@@ -417,25 +420,26 @@ That pipeline object will be fit on the column crim and will be used to transfor
 
 Moving on to the :code:`"indus"` column defined by the configuration. We can see that it has an intent override but not a pipeline override. This means
 that the default :code:`single_pipeline` for the given intent will be used to process that column. By default the serialized pipeline will have
-a list of partially matching Intents as a third item in the JSON list following the column name. These can likely be substituted into the Intent name with little or no
+a list of partially matching intents under the "all_matched_intents" dict key. These can likely be substituted into the Intent name with little or no
 compatibility issues.
 
 Intent Override
 ~~~~~~~~~~~~~~~
 
 .. code-block:: json
 
-    {"intents":{
+      "intents":{
         "NumericIntent":{
           "single":[
-            ["Imputer", "impute", {"strategy":"mean"}]
+            {"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
           ],
           "multi":[]
         }
-    }}
+      }
+
 
 Next, we will examine the :code:`intents` section. This section is used to override intents globally, unlike the columns section which overrode intents on a per-column
-basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline.
+basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline. However, individual pipelines defined in the columns section will override pipelines defined here.
 
 The keys in this section each represent the name of an intent. In this example, :code:`NumericIntent` is being overridden. The value is a dictionary with the
 keys :code:`"single"` and :code:`"multi"` respresent the single and multi pipeline overrides. The value of these pipelines is parsed through the same mechanism as the pipelines
@@ -452,14 +456,15 @@ Postprocessor Override
 
 .. code-block:: json
 
+
     {"postprocess":[
-        ["pca",["age"],[
-            ["PCA", "PCA", {"n_components":2}]
-        ]]
+        {"name":"pca","columns":["age"],"pipeline":[
+            {"class":"PCA", "name":"PCA", "parameters":{"n_components":2}}
+        ]}
     ]}
 
 Finally, in the :code:`postprocess` section of the configuration, you can manually define pipelines to execute on columns of your choosing. The
-content of this section is a list of lists of the form :code:`[[name, [cols, ...], pipeline], ...]`. Each list defines a pipeline that will
+content of this section is a list of dictionaries of the form :code:`[{"name":name, "columns":[cols, ...], "pipeline":pipeline}, ...]`. Each list defines a pipeline that will
 execute on certain columns. These processes execute after the intent pipelines!
 
 **IMPORTANT** There are two ways of selecting columns through the cols list. By default, specifying a column, or a list of columns, will automatically select
@@ -512,27 +517,29 @@ This is what a combinations section looks like.
 
 .. code-block:: json
 
-    {
-      "columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
-      },
 
-      "postprocess":[],
+        {
+          "columns":{
+            "crim":{"intent": "GenericIntent",
+                    "pipeline": [
+                            {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                    ]},
+            "indus":{"intent": "GenericIntent"}
+          },
 
-      "intents":{},
+          "postprocess":[],
+
+          "intents":{},
+
+          "combinations": [
+            {
+              "columns.crim.pipeline.0.parameters.with_mean": "[True, False]",
+              "columns.crim.pipeline.0.name": "['Scaler', 'SuperScaler']"
+            }
+          ]
 
-      "combinations": [
-        {
-          "columns.crim.1.0.2.with_mean": "[True, False]",
-          "columns.crim.1.0.1": "['Scaler', 'SuperScaler']"
         }
-      ]
 
-    }
 
 
 This section of the configuration file is a list of dictionaries. Each dictionary represents a single parameter space definition that should be searched. Within these dictionaries

diff --git a/foreshadow/optimizers/param_mapping.py b/foreshadow/optimizers/param_mapping.py
@@ -148,24 +148,28 @@ def _set_path(key, value, original):
     """
     path = key.split(".")
     temp = original
+    curr_key = ""
 
     try:
 
         # Searches in path, indexed by key if dictionary or by index if list
         for p in path[:-1]:
+            curr_key = p
             if isinstance(temp, list):
                 temp = temp[int(p)]
             else:  # Dictionary
                 temp = temp[p]
 
-        # Set value
-        if isinstance(temp, list):
-            temp[int(path[-1])] = value
-        else:  # Dictionary
-            temp[path[-1]] = value
+        # Always Dictionary
+        temp[path[-1]] = value
 
     except KeyError as e:
-        raise ValueError("Invalid JSON Key")
+        raise ValueError("Invalid JSON Key {} in {}".format(curr_key, temp))
+
+    except ValueError as e:
+        raise ValueError(
+            "Attempted to index list {} with value {}".format(temp, curr_key)
+        )
 
 
 def _extract_config_params(param):

diff --git a/foreshadow/preprocessor.py b/foreshadow/preprocessor.py
@@ -290,18 +290,18 @@ def _init_json(self):
                 # Iterate columns
                 for k, v in config["columns"].items():
                     # Assign custom intent map
-                    self._intent_map[k] = registry_eval(v[0])
+                    self._intent_map[k] = registry_eval(v["intent"])
 
                     # Assign custom pipeline map
-                    if len(v) > 1:
-                        self._pipeline_map[k] = resolve_pipeline(v[1])
+                    if "pipeline" in v.keys():
+                        self._pipeline_map[k] = resolve_pipeline(v["pipeline"])
 
             # Resolve postprocess section into a list of pipelines
             if "postprocess" in config.keys():
                 self._multi_column_map = [
-                    [v[PipelineStep["NAME"]], v[1], resolve_pipeline(v[2])]
+                    [v["name"], v["columns"], resolve_pipeline(v["pipeline"])]
                     for v in config["postprocess"]
-                    if len(v) >= 3
+                    if validate_pipeline(v)
                 ]
 
             # Resolve intents section into a dictionary of intents and pipelines
@@ -311,10 +311,10 @@ def _init_json(self):
                     for k, v in config["intents"].items()
                 }
 
+        except KeyError as e:
+            raise ValueError("JSON Configuration is malformed: {}".format(str(e)))
         except ValueError as e:
             raise e
-        except Exception as e:
-            raise ValueError("JSON Configuration is malformed: {}".format(str(e)))
 
     def get_params(self, deep=True):
         if self.pipeline is None:
@@ -343,19 +343,20 @@ def serialize(self):
 
         """
         json_cols = {
-            k: (
-                self._intent_map[k].__name__,
-                serialize_pipeline(
+            k: {
+                "intent": self._intent_map[k].__name__,
+                "pipeline": serialize_pipeline(
                     self._pipeline_map.get(k, Pipeline([("null", None)]))
                 ),
-                [c[1].__name__ for c in self._choice_map[k]],
-            )
+                "all_matched_intents": [c[1].__name__ for c in self._choice_map[k]],
+            }
             for k in self._intent_map.keys()
         }
 
         # Serialize multi-column processors
         json_multi = [
-            [v[0], v[1], serialize_pipeline(v[2])] for v in self._multi_column_map
+            {"name": v[0], "columns": v[1], "pipeline": serialize_pipeline(v[2])}
+            for v in self._multi_column_map
         ]
 
         # Serialize intent multi processors
@@ -439,11 +440,11 @@ def serialize_pipeline(pipeline):
         list: JSON serializable object of form ``[cls, name, {**params}]``
     """
     return [
-        (
-            type(step[PipelineStep["CLASS"]]).__name__,
-            step[PipelineStep["NAME"]],
-            step[PipelineStep["CLASS"]].get_params(deep=False),
-        )
+        {
+            "transformer": type(step[PipelineStep["CLASS"]]).__name__,
+            "name": step[PipelineStep["NAME"]],
+            "pipeline": step[PipelineStep["CLASS"]].get_params(deep=False),
+        }
         for step in pipeline.steps
         if pipeline.steps[0][PipelineStep["NAME"]] != "null"
     ]
@@ -471,16 +472,19 @@ def resolve_pipeline(pipeline_json):
 
     for trans in pipeline_json:
 
-        if len(trans) != 3:
-            raise ValueError(
+        try:
+            clsname = trans["transformer"]
+            name = trans["name"]
+            params = trans["parameters"]
+
+        except KeyError as e:
+            raise KeyError(
                 "Malformed transformer {} correct syntax is"
-                "[cls, name, {{**params}}]".format(trans)
+                '["transformer": cls, "name": name, "pipeline": {{**params}}]'.format(
+                    trans
+                )
             )
 
-        clsname = trans[0]
-        name = trans[1]
-        params = trans[2]
-
         try:
             search_module = (
                 module_internals
@@ -497,9 +501,27 @@ def resolve_pipeline(pipeline_json):
         except Exception as e:
             raise ValueError("Could not import defined transformer {}".format(clsname))
 
-        pipe.append((name, cls(**params)))
+        try:
+            pipe.append((name, cls(**params)))
+        except TypeError as e:
+            raise ValueError(
+                "Params {} invalid for transfomer {}".format(params, cls.__name__)
+            )
 
     if len(pipe) == 0:
         return Pipeline([("null", None)])
 
     return Pipeline(pipe)
+
+
+def validate_pipeline(v):
+    """
+    Validates that a dictionary contains the correct keys for a pipline
+
+    Args:
+      v: (dict) Pipeline dictionary
+
+    Returns: True if dict is valid pipeline
+    """
+
+    return "columns" in v.keys() and "pipeline" in v.keys() and "name" in v.keys()