Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On demand feature views (ODFVs) should use support python dicts #2261

Closed
adchia opened this issue Jan 31, 2022 · 6 comments
Closed

On demand feature views (ODFVs) should use support python dicts #2261

adchia opened this issue Jan 31, 2022 · 6 comments

Comments

@adchia
Copy link
Collaborator

adchia commented Jan 31, 2022

In some test benchmarks, using regular python dicts for inputs for executing the transformations is much faster (up to ~10x) than pandas for the online flow. This tends to be the more latency sensitive flow (offline flows seem to be ~40% slower if using vectorized operations).

Something that looks like:

@on_demand_feature_view(
    sources=[driver_hourly_stats_view, val_to_add_request],
    schema=[
        Field(name="conv_rate_plus_val1", dtype=Float64),
        Field(name="conv_rate_plus_val2", dtype=Float64),
    ],
    mode="python"
)
def transformed_conv_rate(driver_hourly_stats: Dict[str, Any], vals_to_add: Dict[str, Any]) -> Dict[str, Any]:
    features = {}
    features['conv_rate_plus_val1'] = (driver_hourly_stats['conv_rate'] + vals_to_add['val_to_add'])
    features['conv_rate_plus_val2'] = (driver_hourly_stats['conv_rate'] + vals_to_add['val_to_add_2'])
    return features

might be similar to what we want

@judahrand
Copy link
Member

In your example what types would the driver_hourly_stats['conv_rate'] be? Would it be a list? If so this interface will require some additional transformation. Would it be better to pass Numpy Arrays? If so have you factored the instantiation of these into the ~10x speed up?

@adchia
Copy link
Collaborator Author

adchia commented Feb 14, 2022

Numpy arrays should also be a lot better yeah.

It might make sense to support both, but really there's also the factor of what the user will have access to at serving time. Seems more likely to be a standard dict. numpy should def be faster, but I also worry since it's significantly more verbose.

Could also see a world where we allow both pandas or dicts since pandas will be easier to write the transformations but less performant.

In this specific situation, those conv_rate values with be individual doubles.

@judahrand
Copy link
Member

I was actually thinking a dict of 1d Numpy arrays?

@adchia adchia changed the title On demand feature views (ODFVs) should use python dicts instead of dataframes On demand feature views (ODFVs) should use support python dicts May 17, 2022
@stale
Copy link

stale bot commented Sep 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 20, 2022
@stale stale bot closed this as completed Sep 28, 2022
@judahrand judahrand reopened this Sep 28, 2022
@stale stale bot closed this as completed Oct 14, 2022
@judahrand judahrand reopened this Oct 14, 2022
@stale stale bot removed the wontfix This will not be worked on label Oct 14, 2022
@adchia adchia unassigned woop Oct 14, 2022
@franciscojavierarceo
Copy link
Member

tagging @maksstach who has implemented a version of this on our fork

@franciscojavierarceo
Copy link
Member

I did this one #4045 next I'll do writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants