Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

Commit

Permalink
Merge pull request #22 from georgianpartners/issue_10
Browse files Browse the repository at this point in the history
Implemented configuration change resolving issue #10
  • Loading branch information
alexrallen committed Jan 4, 2019
2 parents 33eee1c + c3e80a6 commit be7d368
Show file tree
Hide file tree
Showing 19 changed files with 287 additions and 161 deletions.
105 changes: 56 additions & 49 deletions doc/users.rst
Original file line number Diff line number Diff line change
Expand Up @@ -351,23 +351,25 @@ An example configuration for processing the Boston Housing dataset is below. We
{
"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
},
"postprocess":[
["pca",["age"],[
["PCA", "PCA", {"n_components":2}]
]]
{"name":"pca",
"columns": ["age"],
"pipeline": [
{"transformer": "PCA", "name": "PCA", "parameters": {"n_components":2}}
]}
],
"intents":{
"NumericIntent":{
"single":[
["Imputer", "impute", {"strategy":"mean"}]
{"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
],
"multi":[]
}
Expand All @@ -383,32 +385,33 @@ Column Override

.. code-block:: json
{"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
}}
"columns":{
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
}
This section is a dictionary containing two keys, each of which are columns in the Boston Housing set. First we will look at the value
of the :code:`"crim"` key which is a list.
of the :code:`"crim"` key which is a dict.


.. code-block:: json
{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]}
["GenericIntent",[
["StandardScaler", "Scaler", {"with_mean":false}]
]]
The list is of form :code:`[intent_name, pipeline]`. Here we can see that this column has been assigned the intent :code:`"GenericIntent`
and the pipeline :code:`[["StandardScaler", "Scaler", {"with_mean":false}]]`
Here we can see that this column has been assigned the intent :code:`"GenericIntent`
and the pipeline :code:`[{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}]`

This means that regardless of how Preprocessor automatically assigns Intents, the intent GenericIntent will always be assigned to the crim column.
It also means that regardless of what intent is assigned to the column (this value is still important for multi-pipelines), the Preprocessor will always
use this hard-coded pipeline to process that column. The column would still be processed by its initially identifited multi-pipeline unless explicitly overridden.

The pipeline itself is defined by the following standard :code:`[[class, name, {param_key: param_value, ...}], ...]`
The pipeline itself is defined by the following standard :code:`[{"transformer":class, "name":name, "parameters":{param_key: param_value, ...}], ...]`
When preprocessor parses this configuration it will create a Pipeline object with the given transformers of the given class, name, and parameters.
For example, the preprocessor above will look something like :code:`sklearn.pipeline.Preprocessor([('Scaler', StandardScaler(with_mean=False)))])`
Any class implementing the sklearn Transformer standard (including SmartTransformer) can be used here.
Expand All @@ -417,25 +420,26 @@ That pipeline object will be fit on the column crim and will be used to transfor

Moving on to the :code:`"indus"` column defined by the configuration. We can see that it has an intent override but not a pipeline override. This means
that the default :code:`single_pipeline` for the given intent will be used to process that column. By default the serialized pipeline will have
a list of partially matching Intents as a third item in the JSON list following the column name. These can likely be substituted into the Intent name with little or no
a list of partially matching intents under the "all_matched_intents" dict key. These can likely be substituted into the Intent name with little or no
compatibility issues.

Intent Override
~~~~~~~~~~~~~~~

.. code-block:: json
{"intents":{
"intents":{
"NumericIntent":{
"single":[
["Imputer", "impute", {"strategy":"mean"}]
{"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
],
"multi":[]
}
}}
}
Next, we will examine the :code:`intents` section. This section is used to override intents globally, unlike the columns section which overrode intents on a per-column
basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline.
basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline. However, individual pipelines defined in the columns section will override pipelines defined here.

The keys in this section each represent the name of an intent. In this example, :code:`NumericIntent` is being overridden. The value is a dictionary with the
keys :code:`"single"` and :code:`"multi"` respresent the single and multi pipeline overrides. The value of these pipelines is parsed through the same mechanism as the pipelines
Expand All @@ -452,14 +456,15 @@ Postprocessor Override

.. code-block:: json
{"postprocess":[
["pca",["age"],[
["PCA", "PCA", {"n_components":2}]
]]
{"name":"pca","columns":["age"],"pipeline":[
{"class":"PCA", "name":"PCA", "parameters":{"n_components":2}}
]}
]}
Finally, in the :code:`postprocess` section of the configuration, you can manually define pipelines to execute on columns of your choosing. The
content of this section is a list of lists of the form :code:`[[name, [cols, ...], pipeline], ...]`. Each list defines a pipeline that will
content of this section is a list of dictionaries of the form :code:`[{"name":name, "columns":[cols, ...], "pipeline":pipeline}, ...]`. Each list defines a pipeline that will
execute on certain columns. These processes execute after the intent pipelines!

**IMPORTANT** There are two ways of selecting columns through the cols list. By default, specifying a column, or a list of columns, will automatically select
Expand Down Expand Up @@ -512,27 +517,29 @@ This is what a combinations section looks like.

.. code-block:: json
{
"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
},
"postprocess":[],
{
"columns":{
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
},
"intents":{},
"postprocess":[],
"intents":{},
"combinations": [
{
"columns.crim.pipeline.0.parameters.with_mean": "[True, False]",
"columns.crim.pipeline.0.name": "['Scaler', 'SuperScaler']"
}
]
"combinations": [
{
"columns.crim.1.0.2.with_mean": "[True, False]",
"columns.crim.1.0.1": "['Scaler', 'SuperScaler']"
}
]
}
This section of the configuration file is a list of dictionaries. Each dictionary represents a single parameter space definition that should be searched. Within these dictionaries
Expand Down
16 changes: 10 additions & 6 deletions foreshadow/optimizers/param_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,24 +148,28 @@ def _set_path(key, value, original):
"""
path = key.split(".")
temp = original
curr_key = ""

try:

# Searches in path, indexed by key if dictionary or by index if list
for p in path[:-1]:
curr_key = p
if isinstance(temp, list):
temp = temp[int(p)]
else: # Dictionary
temp = temp[p]

# Set value
if isinstance(temp, list):
temp[int(path[-1])] = value
else: # Dictionary
temp[path[-1]] = value
# Always Dictionary
temp[path[-1]] = value

except KeyError as e:
raise ValueError("Invalid JSON Key")
raise ValueError("Invalid JSON Key {} in {}".format(curr_key, temp))

except ValueError as e:
raise ValueError(
"Attempted to index list {} with value {}".format(temp, curr_key)
)


def _extract_config_params(param):
Expand Down
74 changes: 48 additions & 26 deletions foreshadow/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,18 +290,18 @@ def _init_json(self):
# Iterate columns
for k, v in config["columns"].items():
# Assign custom intent map
self._intent_map[k] = registry_eval(v[0])
self._intent_map[k] = registry_eval(v["intent"])

# Assign custom pipeline map
if len(v) > 1:
self._pipeline_map[k] = resolve_pipeline(v[1])
if "pipeline" in v.keys():
self._pipeline_map[k] = resolve_pipeline(v["pipeline"])

# Resolve postprocess section into a list of pipelines
if "postprocess" in config.keys():
self._multi_column_map = [
[v[PipelineStep["NAME"]], v[1], resolve_pipeline(v[2])]
[v["name"], v["columns"], resolve_pipeline(v["pipeline"])]
for v in config["postprocess"]
if len(v) >= 3
if validate_pipeline(v)
]

# Resolve intents section into a dictionary of intents and pipelines
Expand All @@ -311,10 +311,10 @@ def _init_json(self):
for k, v in config["intents"].items()
}

except KeyError as e:
raise ValueError("JSON Configuration is malformed: {}".format(str(e)))
except ValueError as e:
raise e
except Exception as e:
raise ValueError("JSON Configuration is malformed: {}".format(str(e)))

def get_params(self, deep=True):
if self.pipeline is None:
Expand Down Expand Up @@ -343,19 +343,20 @@ def serialize(self):
"""
json_cols = {
k: (
self._intent_map[k].__name__,
serialize_pipeline(
k: {
"intent": self._intent_map[k].__name__,
"pipeline": serialize_pipeline(
self._pipeline_map.get(k, Pipeline([("null", None)]))
),
[c[1].__name__ for c in self._choice_map[k]],
)
"all_matched_intents": [c[1].__name__ for c in self._choice_map[k]],
}
for k in self._intent_map.keys()
}

# Serialize multi-column processors
json_multi = [
[v[0], v[1], serialize_pipeline(v[2])] for v in self._multi_column_map
{"name": v[0], "columns": v[1], "pipeline": serialize_pipeline(v[2])}
for v in self._multi_column_map
]

# Serialize intent multi processors
Expand Down Expand Up @@ -439,11 +440,11 @@ def serialize_pipeline(pipeline):
list: JSON serializable object of form ``[cls, name, {**params}]``
"""
return [
(
type(step[PipelineStep["CLASS"]]).__name__,
step[PipelineStep["NAME"]],
step[PipelineStep["CLASS"]].get_params(deep=False),
)
{
"transformer": type(step[PipelineStep["CLASS"]]).__name__,
"name": step[PipelineStep["NAME"]],
"pipeline": step[PipelineStep["CLASS"]].get_params(deep=False),
}
for step in pipeline.steps
if pipeline.steps[0][PipelineStep["NAME"]] != "null"
]
Expand Down Expand Up @@ -471,16 +472,19 @@ def resolve_pipeline(pipeline_json):

for trans in pipeline_json:

if len(trans) != 3:
raise ValueError(
try:
clsname = trans["transformer"]
name = trans["name"]
params = trans["parameters"]

except KeyError as e:
raise KeyError(
"Malformed transformer {} correct syntax is"
"[cls, name, {{**params}}]".format(trans)
'["transformer": cls, "name": name, "pipeline": {{**params}}]'.format(
trans
)
)

clsname = trans[0]
name = trans[1]
params = trans[2]

try:
search_module = (
module_internals
Expand All @@ -497,9 +501,27 @@ def resolve_pipeline(pipeline_json):
except Exception as e:
raise ValueError("Could not import defined transformer {}".format(clsname))

pipe.append((name, cls(**params)))
try:
pipe.append((name, cls(**params)))
except TypeError as e:
raise ValueError(
"Params {} invalid for transfomer {}".format(params, cls.__name__)
)

if len(pipe) == 0:
return Pipeline([("null", None)])

return Pipeline(pipe)


def validate_pipeline(v):
"""
Validates that a dictionary contains the correct keys for a pipline
Args:
v: (dict) Pipeline dictionary
Returns: True if dict is valid pipeline
"""

return "columns" in v.keys() and "pipeline" in v.keys() and "name" in v.keys()

0 comments on commit be7d368

Please sign in to comment.