Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create MLPrimitive for feature engineering pipeline #85

Closed
micahjsmith opened this issue May 31, 2021 · 2 comments
Closed

Create MLPrimitive for feature engineering pipeline #85

micahjsmith opened this issue May 31, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@micahjsmith
Copy link
Contributor

micahjsmith commented May 31, 2021

Creating a primitive for Ballet feature engineering pipelines will allow these pipelines to be included in an MLPipeline

  • primitive is generic and will work for any ballet project
  • primitive takes optional init param that gives the fully-qualified name of the Ballet project
  • implement adapter:
    • adapter takes the init param
    • initializes a ballet.client.Client
    • loads the desired project
    • accesses the feature engineering pipeline instances
    • creates a function that returns a deepcopy of the pipeline
    • returns the function
  • if the init param is not given, then a project is detected from cwd
  • expose primitive in new ballet entry points

Prototype (that is not generic)

{
    "name": "predict_census_income.engineer_features",
    "contributors": [
        "Micah Smith <micahs@mit.edu>"
    ],
    "documentation": "",
    "description": "Applies the feature engineering pipeline from the predict_census_income project",
    "classifiers": {
        "type": "preprocessor",
        "subtype": "transformer"
    },
    "modalities": [],
    "primitive": "predict_census_income.api.make_feature_engineering_pipeline",
    "fit": {
        "method": "fit",
        "args": [
            {
                "name": "X",
                "type": "pandas.DataFrame"
            },
            {
                "name": "y",
                "type": "pandas.DataFrame"
            }
        ]
    },
    "produce": {
        "method": "transform",
        "args": [
            {
                "name": "X",
                "type": "pandas.DataFrame"
            }
        ],
        "output": [
            {
                "name": "X",
                "type": "pandas.DataFrame"
            }
        ]
    },
    "hyperparameters": {}
}
@micahjsmith micahjsmith added the enhancement New feature or request label May 31, 2021
@micahjsmith
Copy link
Contributor Author

Something like this allows the feature engineering pipeline to have access to the unencoded targets for supervised transformations

import mlblocks
from ballet import b
from sklearn.metrics import classification_report

X_df, y_df = b.api.load_data()
X_df_te, y_df_te = b.api.load_data(input_dir='data/val')

encoder = b.api.encoder
y = encoder.fit_transform(y_df)
y_te = encoder.transform(y_df_te)

pipeline = mlblocks.MLPipeline(
    primitives=[
        'predict_census_income.engineer_features',
        'sklearn.ensemble.RandomForestClassifier',
    ],
    input_names={
        'predict_census_income.engineer_features#1': {
            'y': 'y_df',
        }
    },
)
pipeline.fit(X_df, y, y_df=y_df)
y_pred = pipeline.predict(X_df)
report = classification_report(y, y_pred, output_dict=True)

y_pred_te = pipeline.predict(X_df_te)
report_te = classification_report(y_te, y_pred_te, output_dict=True)

@micahjsmith
Copy link
Contributor Author

Added in #86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant