Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

Commit

Permalink
Merge branch 'development' into issue_11_financial_intent
Browse files Browse the repository at this point in the history
  • Loading branch information
adithyabsk committed Jan 19, 2019
2 parents b065011 + 6579deb commit d57da6b
Show file tree
Hide file tree
Showing 37 changed files with 1,057 additions and 206 deletions.
108 changes: 57 additions & 51 deletions doc/users.rst
Original file line number Diff line number Diff line change
Expand Up @@ -293,8 +293,7 @@ Preprocessor uses a hierarchical structure defined by the superclass (parent) an
and intent. There is also a priority order defined in each intent to break ties at the same level.

This tree-like structure which has :py:obj:`GenericIntent <foreshadow.intents.GenericIntent>` as its
root node is used to prioritize Intents. Intents further down the tree more precisely define a feature, thus the Intent
farthest from the root node that matches a given feature is assigned to it.
root node is used to prioritize Intents. Intents further down the tree more precisely define a feature and intents further to the right hold a higher priority than those to the left, thus the Intent represented by the right-most node of the tree that matches will be selected.

Each Intent contains a :code:`multi-pipeline` and a :code:`single-pipeline`. These objects are lists of tuples of the form
:code:`[('name', TransformerObject()),...]` and are used by Preprocessor to construct sklearn Pipeline objects.
Expand Down Expand Up @@ -351,23 +350,25 @@ An example configuration for processing the Boston Housing dataset is below. We
{
"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
},
"postprocess":[
["pca",["age"],[
["PCA", "PCA", {"n_components":2}]
]]
{"name":"pca",
"columns": ["age"],
"pipeline": [
{"transformer": "PCA", "name": "PCA", "parameters": {"n_components":2}}
]}
],
"intents":{
"NumericIntent":{
"single":[
["Imputer", "impute", {"strategy":"mean"}]
{"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
],
"multi":[]
}
Expand All @@ -383,32 +384,33 @@ Column Override

.. code-block:: json
{"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
}}
"columns":{
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
}
This section is a dictionary containing two keys, each of which are columns in the Boston Housing set. First we will look at the value
of the :code:`"crim"` key which is a list.
of the :code:`"crim"` key which is a dict.


.. code-block:: json
{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]}
["GenericIntent",[
["StandardScaler", "Scaler", {"with_mean":false}]
]]
The list is of form :code:`[intent_name, pipeline]`. Here we can see that this column has been assigned the intent :code:`"GenericIntent`
and the pipeline :code:`[["StandardScaler", "Scaler", {"with_mean":false}]]`
Here we can see that this column has been assigned the intent :code:`"GenericIntent`
and the pipeline :code:`[{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}]`

This means that regardless of how Preprocessor automatically assigns Intents, the intent GenericIntent will always be assigned to the crim column.
It also means that regardless of what intent is assigned to the column (this value is still important for multi-pipelines), the Preprocessor will always
use this hard-coded pipeline to process that column. The column would still be processed by its initially identifited multi-pipeline unless explicitly overridden.

The pipeline itself is defined by the following standard :code:`[[class, name, {param_key: param_value, ...}], ...]`
The pipeline itself is defined by the following standard :code:`[{"transformer":class, "name":name, "parameters":{param_key: param_value, ...}], ...]`
When preprocessor parses this configuration it will create a Pipeline object with the given transformers of the given class, name, and parameters.
For example, the preprocessor above will look something like :code:`sklearn.pipeline.Preprocessor([('Scaler', StandardScaler(with_mean=False)))])`
Any class implementing the sklearn Transformer standard (including SmartTransformer) can be used here.
Expand All @@ -417,25 +419,26 @@ That pipeline object will be fit on the column crim and will be used to transfor

Moving on to the :code:`"indus"` column defined by the configuration. We can see that it has an intent override but not a pipeline override. This means
that the default :code:`single_pipeline` for the given intent will be used to process that column. By default the serialized pipeline will have
a list of partially matching Intents as a third item in the JSON list following the column name. These can likely be substituted into the Intent name with little or no
a list of partially matching intents under the "all_matched_intents" dict key. These can likely be substituted into the Intent name with little or no
compatibility issues.

Intent Override
~~~~~~~~~~~~~~~

.. code-block:: json
{"intents":{
"intents":{
"NumericIntent":{
"single":[
["Imputer", "impute", {"strategy":"mean"}]
{"transformer": "Imputer", "name": "impute", "parameters": {"strategy":"mean"}}
],
"multi":[]
}
}}
}
Next, we will examine the :code:`intents` section. This section is used to override intents globally, unlike the columns section which overrode intents on a per-column
basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline.
basis. Any changes to intents defined in this section will apply across the entire Preprocessor pipeline. However, individual pipelines defined in the columns section will override pipelines defined here.

The keys in this section each represent the name of an intent. In this example, :code:`NumericIntent` is being overridden. The value is a dictionary with the
keys :code:`"single"` and :code:`"multi"` respresent the single and multi pipeline overrides. The value of these pipelines is parsed through the same mechanism as the pipelines
Expand All @@ -452,14 +455,15 @@ Postprocessor Override

.. code-block:: json
{"postprocess":[
["pca",["age"],[
["PCA", "PCA", {"n_components":2}]
]]
{"name":"pca","columns":["age"],"pipeline":[
{"class":"PCA", "name":"PCA", "parameters":{"n_components":2}}
]}
]}
Finally, in the :code:`postprocess` section of the configuration, you can manually define pipelines to execute on columns of your choosing. The
content of this section is a list of lists of the form :code:`[[name, [cols, ...], pipeline], ...]`. Each list defines a pipeline that will
content of this section is a list of dictionaries of the form :code:`[{"name":name, "columns":[cols, ...], "pipeline":pipeline}, ...]`. Each list defines a pipeline that will
execute on certain columns. These processes execute after the intent pipelines!

**IMPORTANT** There are two ways of selecting columns through the cols list. By default, specifying a column, or a list of columns, will automatically select
Expand Down Expand Up @@ -512,27 +516,29 @@ This is what a combinations section looks like.

.. code-block:: json
{
"columns":{
"crim":["GenericIntent",
[
["StandardScaler", "Scaler", {"with_mean":false}]
]],
"indus":["GenericIntent"]
},
"postprocess":[],
{
"columns":{
"crim":{"intent": "GenericIntent",
"pipeline": [
{"transformer": "StandardScaler", "name": "Scaler", "parameters": {"with_mean":false}}
]},
"indus":{"intent": "GenericIntent"}
},
"intents":{},
"postprocess":[],
"intents":{},
"combinations": [
{
"columns.crim.pipeline.0.parameters.with_mean": "[True, False]",
"columns.crim.pipeline.0.name": "['Scaler', 'SuperScaler']"
}
]
"combinations": [
{
"columns.crim.1.0.2.with_mean": "[True, False]",
"columns.crim.1.0.1": "['Scaler', 'SuperScaler']"
}
]
}
This section of the configuration file is a list of dictionaries. Each dictionary represents a single parameter space definition that should be searched. Within these dictionaries
Expand Down
16 changes: 15 additions & 1 deletion foreshadow/intents/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ def priority_traverse(cls):
node = lqueue.pop(0)
if len(node.children) > 0:
node_children = map(registry_eval, node.children)
lqueue.extend(node_children)
lqueue[0:0] = node_children # Append to beginning to do depth first

@classmethod
@check_base
Expand All @@ -143,6 +143,20 @@ def is_intent(cls, df):
"""
pass # pragma: no cover

@classmethod
@check_base
@abstractmethod
def column_summary(cls, df):
"""Computes relavent statistics and returns a JSON dict of those values
Args:
df: pd.DataFrame to summarize
Returns:
A JSON dict of relavent statistics
"""
pass # pragma: no cover

@classmethod
def _check_intent(cls):
"""Validate class variables are setup properly"""
Expand Down
102 changes: 102 additions & 0 deletions foreshadow/intents/general.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""
General intents defenitions
"""
import json
from collections import OrderedDict

import pandas as pd
import numpy as np
Expand All @@ -11,6 +13,39 @@
from ..transformers.smart import SimpleImputer, MultiImputer, Scaler, Encoder


def _mode_freq(s, count=10):
"""Computes the mode and the most frequent values
Args:
s (pandas.Series): the series to analyze
count (int): the n number of most frequent values
Returns:
A tuple with the list of modes and (the 10 most common values, their
frequency counts, % frequencies)
"""
mode = s.mode().values.tolist()
vc = s.value_counts().nlargest(count).reset_index()
vc["PCT"] = vc.iloc[:, -1] / s.size
return (mode, vc.values.tolist())


def _outliers(s, count=10):
"""Computes the mode and the most frequent values
Args:
s (pandas.Series): the series to analyze
count (int): the n largest (magnitude) outliers
Returns a pandas.Series of outliers
"""
out_ser = s[np.abs(s - s.mean()) > (3 * s.std())]
out_df = out_ser.to_frame()
out_df["selector"] = out_ser.abs()

return out_df.loc[out_df["selector"].nlargest(count).index].iloc[:, 0]


class GenericIntent(BaseIntent):
"""See base class.
Expand All @@ -35,6 +70,11 @@ def is_intent(cls, df):
"""Returns true by default such that a column must match this"""
return True

@classmethod
def column_summary(cls, df):
"""No statistics can be computed for a general column"""
return {}


class NumericIntent(GenericIntent):
"""See base class.
Expand Down Expand Up @@ -66,6 +106,52 @@ def is_intent(cls, df):
.all()
)

@classmethod
def column_summary(cls, df):
"""Returns computed statistics for a NumericIntent column
The following are computed:
nan: count of nans pass into dataset
invalid: number of invalid values after converting to numeric
mean: -
std: -
min: -
25th: 25th percentile
median: -
75th: 75th percentile
max: -
mode: mode or np.nan if data is mostly unique
top10: top 10 most frequent values or empty array if mostly unique
[(value, count),...,]
10outliers: largest 10 outliers
"""

data = df.ix[:, 0]
nan_num = int(data.isnull().sum())
invalid_num = int(
pd.to_numeric(df.ix[:, 0], errors="coerce").isnull().sum() - nan_num
)
outliers = _outliers(data).values.tolist()
mode, top10 = _mode_freq(data)

return OrderedDict(
[
("nan", nan_num),
("invalid", invalid_num),
("mean", data.mean()),
("std", data.std()),
("min", data.min()),
("25th", data.quantile(0.25)),
("median", data.quantile()),
("75th", data.quantile(0.75)),
("max", data.max()),
("mode", mode),
("top10", top10),
("10outliers", outliers),
]
)


class CategoricalIntent(GenericIntent):
"""See base class.
Expand Down Expand Up @@ -94,3 +180,19 @@ def is_intent(cls, df):
return True
else:
return (1.0 * data.nunique() / data.count()) < 0.2

@classmethod
def column_summary(cls, df):
"""Returns computed statistics for a CategoricalIntent column
The following are computed:
nan: count of nans pass into dataset
mode: mode or np.nan if data is mostly unique
top10: top 10 most frequent values or empty array if mostly unique
[(value, count),...,]
"""
data = df.ix[:, 0]
nan_num = int(data.isnull().sum())
mode, top10 = _mode_freq(data)

return OrderedDict([("nan", nan_num), ("mode", mode), ("top10", top10)])
16 changes: 10 additions & 6 deletions foreshadow/optimizers/param_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,24 +148,28 @@ def _set_path(key, value, original):
"""
path = key.split(".")
temp = original
curr_key = ""

try:

# Searches in path, indexed by key if dictionary or by index if list
for p in path[:-1]:
curr_key = p
if isinstance(temp, list):
temp = temp[int(p)]
else: # Dictionary
temp = temp[p]

# Set value
if isinstance(temp, list):
temp[int(path[-1])] = value
else: # Dictionary
temp[path[-1]] = value
# Always Dictionary
temp[path[-1]] = value

except KeyError as e:
raise ValueError("Invalid JSON Key")
raise ValueError("Invalid JSON Key {} in {}".format(curr_key, temp))

except ValueError as e:
raise ValueError(
"Attempted to index list {} with value {}".format(temp, curr_key)
)


def _extract_config_params(param):
Expand Down

0 comments on commit d57da6b

Please sign in to comment.