Merge branch 'development' into issue_11_financial_intent

georgian-io-archive · Jan 19, 2019 · d57da6b · d57da6b
2 parents b065011 + 6579deb
commit d57da6b
Show file tree

Hide file tree

Showing 37 changed files with 1,057 additions and 206 deletions.
diff --git a/doc/users.rst b/doc/users.rst
@@ -293,8 +293,7 @@ Preprocessor uses a hierarchical structure defined by the superclass (parent) an
 and intent. There is also a priority order defined in each intent to break ties at the same level.
 
 This tree-like structure which has :py:obj:`GenericIntent <foreshadow.intents.GenericIntent>` as its
-root node is used to prioritize Intents. Intents further down the tree more precisely define a feature, thus the Intent
-farthest from the root node that matches a given feature is assigned to it.
+root node is used to prioritize Intents. Intents further down the tree more precisely define a feature and intents further to the right hold a higher priority than those to the left, thus the Intent represented by the right-most node of the tree that matches will be selected.
 
 Each Intent contains a :code:`multi-pipeline` and a :code:`single-pipeline`. These objects are lists of tuples of the form
 :code:`[('name', TransformerObject()),...]` and are used by Preprocessor to construct sklearn Pipeline objects.
@@ -351,23 +350,25 @@ An example configuration for processing the Boston Housing dataset is below. We
 
     {
       "columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
+        "crim":{"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                ]},
+        "indus":{"intent": "GenericIntent"}
       },
 
       "postprocess":[
-        ["pca",["age"],[
-          ["PCA", "PCA", {"n_components":2}]
-        ]]
+        {"name":"pca",
+         "columns": ["age"],
+         "pipeline": [
+          {"transformer": "PCA", "name": "PCA", "parameters": {"n_components":2}}
+        ]}
       ],
 
       "intents":{
         "NumericIntent":{
           "single":[
-            ["Imputer", "impute", {"strategy":"mean"}]
+            {"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
           ],
           "multi":[]
         }
@@ -383,32 +384,33 @@ Column Override
 
 .. code-block:: json
 
-      {"columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
-      }}
+      "columns":{
+        "crim":{"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                ]},
+        "indus":{"intent": "GenericIntent"}
+      }
 
 This section is a dictionary containing two keys, each of which are columns in the Boston Housing set. First we will look at the value
-of the :code:`"crim"` key which is a list.
+of the :code:`"crim"` key which is a dict.
 
 
 .. code-block:: json
+        
+        {"intent": "GenericIntent",
+                "pipeline": [
+                  {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+        ]}
 
-    ["GenericIntent",[
-        ["StandardScaler", "Scaler", {"with_mean":false}]
-    ]]
-
-The list is of form :code:`[intent_name, pipeline]`. Here we can see that this column has been assigned the intent :code:`"GenericIntent`
-and the pipeline :code:`[["StandardScaler", "Scaler", {"with_mean":false}]]`
+Here we can see that this column has been assigned the intent :code:`"GenericIntent`
+and the pipeline :code:`[{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}]`
 
 This means that regardless of how Preprocessor automatically assigns Intents, the intent GenericIntent will always be assigned to the crim column.
 It also means that regardless of what intent is assigned to the column (this value is still important for multi-pipelines), the Preprocessor will always
 use this hard-coded pipeline to process that column. The column would still be processed by its initially identifited multi-pipeline unless explicitly overridden.
 
-The pipeline itself is defined by the following standard :code:`[[class, name, {param_key: param_value, ...}], ...]`
+The pipeline itself is defined by the following standard :code:`[{"transformer":class, "name":name, "parameters":{param_key: param_value, ...}], ...]`
 When preprocessor parses this configuration it will create a Pipeline object with the given transformers of the given class, name, and parameters.
 For example, the preprocessor above will look something like :code:`sklearn.pipeline.Preprocessor([('Scaler', StandardScaler(with_mean=False)))])`
 Any class implementing the sklearn Transformer standard (including SmartTransformer) can be used here.
@@ -417,25 +419,26 @@ That pipeline object will be fit on the column crim and will be used to transfor
 
 Moving on to the :code:`"indus"` column defined by the configuration. We can see that it has an intent override but not a pipeline override. This means
 that the default :code:`single_pipeline` for the given intent will be used to process that column. By default the serialized pipeline will have
-a list of partially matching Intents as a third item in the JSON list following the column name. These can likely be substituted into the Intent name with little or no
+a list of partially matching intents under the "all_matched_intents" dict key. These can likely be substituted into the Intent name with little or no
 compatibility issues.
 
 Intent Override
 ~~~~~~~~~~~~~~~
 
 .. code-block:: json
 
-    {"intents":{
+      "intents":{
         "NumericIntent":{
           "single":[
-            ["Imputer", "impute", {"strategy":"mean"}]
+            {"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
           ],
           "multi":[]
         }
-    }}
+      }
+
 
 Next, we will examine the :code:`intents` section. This section is used to override intents globally, unlike the columns section which overrode intents on a per-column
-basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline.
+basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline. However, individual pipelines defined in the columns section will override pipelines defined here.
 
 The keys in this section each represent the name of an intent. In this example, :code:`NumericIntent` is being overridden. The value is a dictionary with the
 keys :code:`"single"` and :code:`"multi"` respresent the single and multi pipeline overrides. The value of these pipelines is parsed through the same mechanism as the pipelines
@@ -452,14 +455,15 @@ Postprocessor Override
 
 .. code-block:: json
 
+
     {"postprocess":[
-        ["pca",["age"],[
-            ["PCA", "PCA", {"n_components":2}]
-        ]]
+        {"name":"pca","columns":["age"],"pipeline":[
+            {"class":"PCA", "name":"PCA", "parameters":{"n_components":2}}
+        ]}
     ]}
 
 Finally, in the :code:`postprocess` section of the configuration, you can manually define pipelines to execute on columns of your choosing. The
-content of this section is a list of lists of the form :code:`[[name, [cols, ...], pipeline], ...]`. Each list defines a pipeline that will
+content of this section is a list of dictionaries of the form :code:`[{"name":name, "columns":[cols, ...], "pipeline":pipeline}, ...]`. Each list defines a pipeline that will
 execute on certain columns. These processes execute after the intent pipelines!
 
 **IMPORTANT** There are two ways of selecting columns through the cols list. By default, specifying a column, or a list of columns, will automatically select
@@ -512,27 +516,29 @@ This is what a combinations section looks like.
 
 .. code-block:: json
 
-    {
-      "columns":{
-        "crim":["GenericIntent",
-                [
-                  ["StandardScaler", "Scaler", {"with_mean":false}]
-                ]],
-        "indus":["GenericIntent"]
-      },
 
-      "postprocess":[],
+        {
+          "columns":{
+            "crim":{"intent": "GenericIntent",
+                    "pipeline": [
+                            {"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
+                    ]},
+            "indus":{"intent": "GenericIntent"}
+          },
 
-      "intents":{},
+          "postprocess":[],
+
+          "intents":{},
+
+          "combinations": [
+            {
+              "columns.crim.pipeline.0.parameters.with_mean": "[True, False]",
+              "columns.crim.pipeline.0.name": "['Scaler', 'SuperScaler']"
+            }
+          ]
 
-      "combinations": [
-        {
-          "columns.crim.1.0.2.with_mean": "[True, False]",
-          "columns.crim.1.0.1": "['Scaler', 'SuperScaler']"
         }
-      ]
 
-    }
 
 
 This section of the configuration file is a list of dictionaries. Each dictionary represents a single parameter space definition that should be searched. Within these dictionaries

diff --git a/foreshadow/intents/base.py b/foreshadow/intents/base.py
@@ -127,7 +127,7 @@ def priority_traverse(cls):
             node = lqueue.pop(0)
             if len(node.children) > 0:
                 node_children = map(registry_eval, node.children)
-                lqueue.extend(node_children)
+                lqueue[0:0] = node_children  # Append to beginning to do depth first
 
     @classmethod
     @check_base
@@ -143,6 +143,20 @@ def is_intent(cls, df):
         """
         pass  # pragma: no cover
 
+    @classmethod
+    @check_base
+    @abstractmethod
+    def column_summary(cls, df):
+        """Computes relavent statistics and returns a JSON dict of those values
+
+        Args:
+            df: pd.DataFrame to summarize
+
+        Returns:
+            A JSON dict of relavent statistics
+        """
+        pass  # pragma: no cover
+
     @classmethod
     def _check_intent(cls):
         """Validate class variables are setup properly"""

diff --git a/foreshadow/intents/general.py b/foreshadow/intents/general.py
@@ -1,6 +1,8 @@
 """
 General intents defenitions
 """
+import json
+from collections import OrderedDict
 
 import pandas as pd
 import numpy as np
@@ -11,6 +13,39 @@
 from ..transformers.smart import SimpleImputer, MultiImputer, Scaler, Encoder
 
 
+def _mode_freq(s, count=10):
+    """Computes the mode and the most frequent values
+
+        Args:
+            s (pandas.Series): the series to analyze
+            count (int): the n number of most frequent values
+
+        Returns:
+            A tuple with the list of modes and (the 10 most common values, their
+            frequency counts, % frequencies)
+    """
+    mode = s.mode().values.tolist()
+    vc = s.value_counts().nlargest(count).reset_index()
+    vc["PCT"] = vc.iloc[:, -1] / s.size
+    return (mode, vc.values.tolist())
+
+
+def _outliers(s, count=10):
+    """Computes the mode and the most frequent values
+
+        Args:
+            s (pandas.Series): the series to analyze
+            count (int): the n largest (magnitude) outliers
+
+        Returns a pandas.Series of outliers
+    """
+    out_ser = s[np.abs(s - s.mean()) > (3 * s.std())]
+    out_df = out_ser.to_frame()
+    out_df["selector"] = out_ser.abs()
+
+    return out_df.loc[out_df["selector"].nlargest(count).index].iloc[:, 0]
+
+
 class GenericIntent(BaseIntent):
     """See base class.
 
@@ -35,6 +70,11 @@ def is_intent(cls, df):
         """Returns true by default such that a column must match this"""
         return True
 
+    @classmethod
+    def column_summary(cls, df):
+        """No statistics can be computed for a general column"""
+        return {}
+
 
 class NumericIntent(GenericIntent):
     """See base class.
@@ -66,6 +106,52 @@ def is_intent(cls, df):
             .all()
         )
 
+    @classmethod
+    def column_summary(cls, df):
+        """Returns computed statistics for a NumericIntent column
+
+            The following are computed:
+                nan: count of nans pass into dataset
+                invalid: number of invalid values after converting to numeric
+                mean: -
+                std: -
+                min: -
+                25th: 25th percentile
+                median: -
+                75th: 75th percentile
+                max: -
+                mode: mode or np.nan if data is mostly unique
+                top10: top 10 most frequent values or empty array if mostly unique
+                    [(value, count),...,]
+                10outliers: largest 10 outliers
+
+        """
+
+        data = df.ix[:, 0]
+        nan_num = int(data.isnull().sum())
+        invalid_num = int(
+            pd.to_numeric(df.ix[:, 0], errors="coerce").isnull().sum() - nan_num
+        )
+        outliers = _outliers(data).values.tolist()
+        mode, top10 = _mode_freq(data)
+
+        return OrderedDict(
+            [
+                ("nan", nan_num),
+                ("invalid", invalid_num),
+                ("mean", data.mean()),
+                ("std", data.std()),
+                ("min", data.min()),
+                ("25th", data.quantile(0.25)),
+                ("median", data.quantile()),
+                ("75th", data.quantile(0.75)),
+                ("max", data.max()),
+                ("mode", mode),
+                ("top10", top10),
+                ("10outliers", outliers),
+            ]
+        )
+
 
 class CategoricalIntent(GenericIntent):
     """See base class.
@@ -94,3 +180,19 @@ def is_intent(cls, df):
             return True
         else:
             return (1.0 * data.nunique() / data.count()) < 0.2
+
+    @classmethod
+    def column_summary(cls, df):
+        """Returns computed statistics for a CategoricalIntent column
+
+            The following are computed:
+                nan: count of nans pass into dataset
+                mode: mode or np.nan if data is mostly unique
+                top10: top 10 most frequent values or empty array if mostly unique
+                    [(value, count),...,]
+        """
+        data = df.ix[:, 0]
+        nan_num = int(data.isnull().sum())
+        mode, top10 = _mode_freq(data)
+
+        return OrderedDict([("nan", nan_num), ("mode", mode), ("top10", top10)])
diff --git a/foreshadow/optimizers/param_mapping.py b/foreshadow/optimizers/param_mapping.py
@@ -148,24 +148,28 @@ def _set_path(key, value, original):
     """
     path = key.split(".")
     temp = original
+    curr_key = ""
 
     try:
 
         # Searches in path, indexed by key if dictionary or by index if list
         for p in path[:-1]:
+            curr_key = p
             if isinstance(temp, list):
                 temp = temp[int(p)]
             else:  # Dictionary
                 temp = temp[p]
 
-        # Set value
-        if isinstance(temp, list):
-            temp[int(path[-1])] = value
-        else:  # Dictionary
-            temp[path[-1]] = value
+        # Always Dictionary
+        temp[path[-1]] = value
 
     except KeyError as e:
-        raise ValueError("Invalid JSON Key")
+        raise ValueError("Invalid JSON Key {} in {}".format(curr_key, temp))
+
+    except ValueError as e:
+        raise ValueError(
+            "Attempted to index list {} with value {}".format(temp, curr_key)
+        )
 
 
 def _extract_config_params(param):