Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rulefit with categories + multi-column problems #117

Closed
avraam-inside opened this issue Jun 12, 2022 · 2 comments
Closed

Rulefit with categories + multi-column problems #117

avraam-inside opened this issue Jun 12, 2022 · 2 comments
Labels
duplicate This issue or pull request already exists enhancement New feature or request

Comments

@avraam-inside
Copy link

Hi!

Introductory

In the course of exploring the possibilities of rulefit (via models), questions appeared, I would be happy to discuss them/get hints/etc.

First I will describe the dataframe + code, then I will show the results and there will actually be questions about them.


Input data

An data.csv has been generated in which the gross part of cases (oil supply processes) lasts 4-6 hours (case_duration ~20.000), and there are abnormal cases that last 1.5 days (case_duration > 100.000).

It is necessary to find out - what is the reason for this anomaly?

Here are the conditions that affected the high duration :

image
(this can be seen even by human viewing of the table)

In addition, there is an eventlog at the input (only those columns that had at least a minimal impact are served - this was calculated separately earlier), there is a breakdown condition, there is an understanding of what you want to get at the output.


CODE: Sending an eventlog to rulefit

Here is a jupiter notebook code (change format to .ipynb) with a code, here is an (again) data.csv.

Here is the result:
image

As you can see, it does not meet expectations somewhat.

Questions:

  1. The first thing that catches your eye is some 0.5 and equal signs in different directions. Why do they appear at the exit? After one-hot-encoding, the algorithm has only 0 and 1, a pure category. He knows how to do without them, by the way. Example of a rule from a simple eventlog:
    image
    (everything is right here, there is nothing to find fault with).

    Is there any way to tell the algorithm not to generate these numbers for category columns?

  2. While the conditions are from 5 to 9, the algorithm returns only 3-4 conditions, and with an incorrect answer and large coefficient...

  3. Based on the points above, are there any ideas on how to configure the algorithm so that it returns the correct set of rules?


P.S.:

  1. About the documentation: in many ways, multi-format arrays are used for examples - why? In general, all cases from my personal practice are based on tables (pandas), because they are more convenient ...
  2. ConvergenceWarning message: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 0.000e+00, tolerance: 0.000e+00 - there are 2 points:
    first, why does it return this message n-times clogging up the console, and not one?
    secondly, what specific parameters in the submitted RuleFitRegressor are proposed to increase?
@avraam-inside
Copy link
Author

Hi!

We found out that RuleFitRegressor returns a minimally working set of rules that allows you to filter the DataFrame and get the same set of abnormal cases, and I was expecting a complete set of columns with their values in vain.

There remains a question of 0.5 for categorical columns, but I can live with this in the current implementation - I will write a tricky regular expression that will remove these <=>0.5 for categories.

At the moment, the question is not very relevant, I may close it later if we don't dig up something else.

@csinva csinva added the enhancement New feature or request label Jul 3, 2022
@csinva csinva added the duplicate This issue or pull request already exists label Jul 29, 2022
@csinva
Copy link
Owner

csinva commented Jul 29, 2022

Duplicate of #77 (supporting categorical features).

@csinva csinva closed this as completed Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants