This repository has been archived by the owner on Jan 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Cleaned version with NoTransform Neither processor Disable TFIDF and use NoTransform for NeitherType Integrating updated auto intent resolving
- Loading branch information
Showing
40 changed files
with
2,843 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
"""Neither Numerical Nor Categorical.""" | ||
|
||
from foreshadow.metrics import ( | ||
MetricWrapper, | ||
has_long_text, | ||
is_numeric, | ||
is_string, | ||
num_valid, | ||
unique_heur, | ||
) | ||
from foreshadow.utils import standard_col_summary | ||
|
||
from .base import BaseIntent | ||
|
||
|
||
class Neither(BaseIntent): | ||
"""Defines a Neither column type. | ||
For now it mimics the Text intent. | ||
""" | ||
|
||
confidence_computation = { | ||
MetricWrapper(num_valid): 0.2, | ||
MetricWrapper(unique_heur): 0.2, | ||
MetricWrapper(is_numeric, invert=True): 0.2, | ||
MetricWrapper(is_string): 0.2, | ||
MetricWrapper(has_long_text): 0.2, | ||
} | ||
|
||
def fit(self, X, y=None, **fit_params): | ||
"""Empty fit. | ||
Args: | ||
X: The input data | ||
y: The response variable | ||
**fit_params: Additional parameters for the fit | ||
Returns: | ||
self | ||
""" | ||
return self | ||
|
||
def transform(self, X, y=None): | ||
"""Convert a column to a text form. | ||
Args: | ||
X: The input data | ||
y: The response variable | ||
Returns: | ||
A column with all rows converted to text. | ||
""" | ||
return X.astype(str) | ||
|
||
@classmethod | ||
def column_summary(cls, df): # noqa | ||
return standard_col_summary(df) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
"""Intent resolver definition.""" | ||
from foreshadow.smart.intent_resolving.intentresolver import IntentResolver | ||
|
||
|
||
__all__ = ["IntentResolver"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# automl_research | ||
Code repository for AutoML research to support Foreshadow project | ||
|
||
--- | ||
|
||
## Feature Type Inference (Intent Resolution) | ||
When analyzing raw data set feature columns in `Foreshadow`, the type (intent) of the each feature column has to be known a priori to select the appropriate feature transformation downstream. | ||
|
||
The goal of this research project is to build an intent resolver that can separate numerical and categorical raw feature columns. More classes can be added in the future. | ||
|
||
### Installation | ||
This library was developed on Python 3.6.8 and uses the same package dependencies as `Foreshadow` as of Oct. 17, 2019. | ||
|
||
To install additional package dependencies for research-based functionalities, run the following: | ||
``` | ||
pip install -r research_requirements.txt | ||
``` | ||
|
||
### Usage | ||
The functionality of this library is exposed through the `IntentResolver` class API as shown below. The class outputs a prediction of "Numerical", "Categorical" or "Neither" for each raw feature column. Predictions with confidences lower than the `threshold` parameter (default = 0.7) in the `.predict` method are set to "Neither". | ||
|
||
``` | ||
import pandas as pd | ||
from lib import IntentResolver | ||
# Initialise object | ||
raw = pd.read_csv('path_to_dataset.csv', encoding='latin', low_memory=False) | ||
resolver = IntentResolver(raw) | ||
# Predict intent | ||
# Outputs a pd.Series of predicted intents | ||
resolver.predict() | ||
# OR: Predict intent with confidences at a lower threshold (i.e. less rigorous prediction) | ||
# Outputs a pd.DataFrame of predicted intent and confidences | ||
resolver.predict(threshold=0.6, return_conf=True) | ||
``` | ||
|
||
|
||
### Data Sources | ||
- [Original Meta Data Set (OMDS)](https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/ML%20Schema%20Inference/Data) | ||
- [360 Raw Data Sets (RDSs)](https://drive.google.com/file/d/1HGmDRBSZg-Olym2envycHPkb3uwVWHJX/view) (Sourced from the [GitHub README.md](https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/ML%20Schema%20Inference)) | ||
|
||
|
||
### References | ||
1. V. Shah, P. Kumar, K. Yang, and A. Kumar, “Towards semi-automatic mlfeature type inference." | ||
2. N. Hynes, D. Sculley, and M. Terry, “The data linter: Lightweight, auto-mated sanity checking for ml data sets,” in NIPS MLSys Workshop, 2017. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
"""Module containing the core IntentResolver logic to be used in production.""" | ||
from . import heuristics, io | ||
from .data_set_parsers import DataFrameDataSetParser | ||
from .intent_resolver import IntentResolver | ||
from .secondary_featurizers import ( | ||
FeaturizerCurator, | ||
factory as featurizer_factory, | ||
) |
3 changes: 3 additions & 0 deletions
3
foreshadow/smart/intent_resolving/core/data_set_parsers/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
"""Module containing data set parser class defintions.""" | ||
from .base_data_set_parser import DataSetParser | ||
from .dataframe_data_set_parser import DataFrameDataSetParser |
Oops, something went wrong.