Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Creation] Decision tree creates a new feature by combining numerous variables #454

Closed

Conversation

Morgan-Sell
Copy link
Collaborator

Closes #107

Notes from #107:

New variables are created by combination of user indicated variables with decision trees. Example: if user passes 3 variables to transformer, a new feature will be created fitting a decision tree with this tree variables and the target.

To think about:
Should we make the transformer so that it combines variables in groups of 2s and 3s, etc? Say the user passes 5 variables, should we create features combining all possible groups of 2s, all possible groups of 3s, all possible groups of 4s and all 5?

Need to think a bit. I know that we do combine a few variables with trees to create new ones, particularly for use in linear models. But this brute force of combining everything with everything for the sake of combining, I have not seen in organisations where models will be used to score customers. So maybe not ideal. Also, increases computational cost, which is not in the spirit of feature-engine.

@Morgan-Sell
Copy link
Collaborator Author

halo @solegalli,

A few questions:

  • Did we make a final decision on whether the class should create new features from all the possible permutations of the user-selected variables? I guess we could create an all_permutations init param. Although, I am not convinced of the param's value. I can't think of a use case; however, my experience is limited.
  • Should the class allow the user to choose from all of the sklearn decision-tree init params - e.g. max_depth and min_sample_leaf - to prevent overfitting? Or, do we want to limit the user to 1 or 2 params?
  • I plan to limit the variables to numerical; unless, the categorical variable is encoded. Do you agree?
  • Should the class apply to both regression and classification?

@solegalli
Copy link
Collaborator

An idea would be:

These parameters in the init:
variables = variable list (as always)
output_features = None, integer, list of integers, tuple

So if I pass three variables in the list: [var1, var2, var3] and:

  • 1 in the output_features: return new features based on predictions of the three based on each variable individually, 3 new features
  • 2: make all possible combinations of 2 variables: (var1, var2), (var1, var3), (var2,var3), 3 new features in this example
  • 3 make all possible combinations of 3: in this case only 1 possible combination (var1,var2,var3), 1 new feature in this example
  • 4 or greater raise an error, more combinations than variables in list

If i pass a list, say [1,2], then we return the output of 1 and 2 as above.
If None, then return all possible 1s, 2s, and 3s in this case, if the list contained more variables, then it would also be 4s and 5s

Alternatively, the user can pass a tuple with tuples (var1, (var,var2), (var1,var2,var3)) indicating how to combine the variables.

@Morgan-Sell
Copy link
Collaborator Author

hola @solegalli,

Espero que estes disfrutando vacay!

When/why would a person apply the decision-tree transformer to one variable?

@Morgan-Sell
Copy link
Collaborator Author

hallo @solegalli,

The transformer is generating new variables. I have created a few unit tests. I've written some of the docstrings. Before I progress, would you please review/counsel me? We both know I need it ;)

A few questions:

  • Which decision tree params should we include to mitigate the risk of overfitting? Currently, the class only accepts max_depth.
  • I was surprised that the BaseCreation class does not create self.variables_. Typically, we use the function _find_or_check_numerical_variables() to create/return self.variables_. It seems redundant to call this function given that it is called in the BaseCreation class.
  • Is this error ValueError: variables must a list of strings or integers comprise of distinct variables. Got None instead caused by not having self.variables_? Shouldn't this attribute be created/adopted from the BaseCreation class?
  • Do you see a more efficient approach to saving the fitted estimators and generating the new features?

Lastly, I included a couple of TODO comments.

Gracias!

@solegalli
Copy link
Collaborator

Hi @Morgan-Sell

I've seen you made a lot of commits. Is this work in progress? Do you still need to update the tests? They are all failing :_(

I am on holidays from Thursday till August. So if you don't hear from me... you know why ;)

Cheers

@Morgan-Sell
Copy link
Collaborator Author

hi @solegalli.

Ahh.... a month-long vacation! Hopefully, the US will adopt such traditions one day ;)

I'm still working on this class. I do have one question.

The following test is failing:

FAILED tests/test_creation/test_check_estimator_creation.py::test_check_estimator_from_sklearn[estimator6] - ValueError: No numerical variables found in this dataframe. Please check variable format with pandas dtypes.

Do you know if sklearn's check_estimator tests a dataframe without numerical variables? check_estimator docs has limited information.

If so, then the DecisionTreeFeatures should raise an error. Other feature-engine classes skip certain sklearn checks, which makes sense given that all of sklearn's checks are unlikely to be appropriate for each class. How do we select to omit certain tests?

@Morgan-Sell
Copy link
Collaborator Author

Hi @solegalli,

I'm embarrassed to say this, but I'm stumped by these errors. Hopefully, we can discuss the errors when you return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature creation: create new features by combining variables with decision trees
2 participants