Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use featuretools at the test time? It seems featuretools' feature definitions do not store train time statistics to accurately apply primitives like 'PERCENTILE' at the test time #2697

Open
nitinmnsn opened this issue Mar 23, 2024 · 0 comments

Comments

@nitinmnsn
Copy link

Creating a github issue for better attention. I have a StackOverflow question for the same as well

I would demonstrate the issue with an example:

Let us say we want to use the primitive 'PERCENTILE'

Imports:

import pandas as pd
import featuretools as ft

For training (create a simple data with one column and let featuretools compute a percentile feature on top of it):

df_train = pd.DataFrame({'index':[1,2,3,4,5], 'val':[1,2,3,4,5]})
es_train = ft.EntitySet("es_train")
es_train.add_dataframe(df_train,'df')
fm, fl = ft.dfs(entityset = es_train, trans_primitives=['percentile'], agg_primitives=[], target_dataframe_name='df')

output:

print(fm)
       val  PERCENTILE(val)
index                      
1        1              0.2
2        2              0.4
3        3              0.6
4        4              0.8
5        5              1.0

So far everything is expected

Now, when I get an example with the value, say, 3, at the test time. I would want it translated to 0.6 as per the training data. But, that is not what happens

df_test = pd.DataFrame({'index':[1], 'val':[3]})
es_test = ft.EntitySet("es_test")
es_test.add_dataframe(df_test,'df')
ft.calculate_feature_matrix(features = fl, entityset=es_test)

output:

       val  PERCENTILE(val)
index                      
1        3              1.0

So, metadata in feature definitions in fl that is the output of ft.dfs does not store train time stats needed to compute the features at the test time. This would throw any machine-learning model into a tailspin

What is the canonical way to apply featuretools at the test time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant