Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: correct readme example #50

Merged
merged 4 commits into from
Dec 8, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,17 @@ LOFO first evaluates the performance of the model with all the input features in
If a model is not passed as an argument to LOFO Importance, it will run LightGBM as a default model.

## Install

LOFO Importance can be installed using

```
pip install lofo-importance
```

## Advantages of LOFO Importance
## Advantages of LOFO Importance

LOFO has several advantages compared to other importance types:

* It does not favor granular features
* It generalises well to unseen test sets
* It is model agnostic
Expand All @@ -22,9 +26,10 @@ LOFO has several advantages compared to other importance types:
* It can automatically group highly correlated features to avoid underestimating their importance.

## Example on Kaggle's Microsoft Malware Prediction Competition

In this Kaggle competition, Microsoft provides a malware dataset to predict whether or not a machine will soon be hit with malware. One of the features, Centos_OSVersion is very predictive on the training set, since some OS versions are probably more prone to bugs and failures than others. However, upon splitting the data out of time, we obtain validation sets with OS versions that have not occurred in the training set. Therefore, the model will not have learned the relationship between the target and this seasonal feature. By evaluating this feature's importance using other importance types, Centos_OSVersion seems to have high importance, because its importance was evaluated using only the training set. However, LOFO Importance depends on a validation scheme, so it will not only give this feature low importance, but even negative importance.

```
```python
import pandas as pd
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
Expand All @@ -38,10 +43,10 @@ sample_df = train_df.sample(frac=0.01, random_state=0)
sample_df.sort_values("AvSigVersion", inplace=True)

# define the validation scheme
cv = KFold(n_splits=4, shuffle=False, random_state=0)
cv = KFold(n_splits=4, shuffle=True, random_state=None)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change kfold params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

We can't set shuffle to False and random_state set to 0 as in this capture of my notebook.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it changed with a new sklearn version. Anyway, you can set the random_state to None but shuffle should stay as False since it is doing a lazy time split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it seems like the API has changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to add a comment and make the example a bit more general:

sample_df.sort_values("AvSigVersion", inplace=True)
->
sample_df = sample_df.sample(frac=1) # Shuffling rows before CV

and
cv = KFold(n_splits=4, shuffle=False) # No shuffling to keep the same folds for each feature

Copy link
Owner

@aerdem4 aerdem4 Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought anyone could search for the competition name and get the data but we can also share the link to the competition on readme? We can also comment that AvSigVersion is a proxy for time because the data had no time column.

And having a time split validation is maybe not the best readme example you want to show.

Why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the most common use case.
Ok, now that I read more the readme, you are right that section is called "Example on Kaggle's Microsoft Malware Prediction Competition"
So

sample_df.sort_values("AvSigVersion", inplace=True)  # Sort by time for time split validation

cv = KFold(n_splits=4, shuffle=False)  # Don't shuffle to keep the time split split validation

Maybe good to add a generic example based on sklearn data like Iris.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments should make it easier to understand 👍

LOFO is usually more useful for non-random split problems but such example could also be nice.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KameniAlexNea can you please add these 2 comments above that @stephanecollot shared since you already update the readme? Then I can merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just make a push now


# define the binary target and the features
dataset = Dataset(df=sample_df, target="HasDetections", features=[col for col in train_df.columns if col != target])
dataset = Dataset(df=sample_df, target="HasDetections", features=[col for col in train_df.columns if col != "HasDetections"])

# define the validation scheme and scorer. The default model is LightGBM
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="roc_auc")
Expand All @@ -52,15 +57,16 @@ importance_df = lofo_imp.get_importance()
# plot the means and standard deviations of the importances
plot_importance(importance_df, figsize=(12, 20))
```
![alt text](docs/plot_importance.png?raw=true "Title")

![alt text](docs/plot_importance.png?raw=true "Title")

## Another Example: Kaggle's TReNDS Competition

In this Kaggle competition, pariticipants are asked to predict some cognitive properties of patients.
Independent component features (IC) from sMRI and very high dimensional correlation features (FNC) from 3D fMRIs are provided.
LOFO can group the fMRI correlation features into one.

```
```python
def get_lofo_importance(target):
cv = KFold(n_splits=7, shuffle=True, random_state=17)

Expand All @@ -75,14 +81,14 @@ def get_lofo_importance(target):

plot_importance(get_lofo_importance(target="domain1_var1"), figsize=(8, 8), kind="box")
```

![alt text](docs/plot_importance_box.png?raw=true "Title")

## Flofo Importance

If running the LOFO Importance package is too time-costly for you, you can use Fast LOFO. Fast LOFO, or FLOFO takes, as inputs, an already trained model and a validation set, and does a pseudo-random permutation on the values of each feature, one by one, then uses the trained model to make predictions on the validation set. The mean of the FLOFO importance is then the difference in the performance of the model on the validation set over several randomised permutations.
The difference between FLOFO importance and permutation importance is that the permutations on a feature's values are done within groups, where groups are obtained by grouping the validation set by k=2 features. These k features are chosen at random n=10 times, and the mean and standard deviation of the FLOFO importance are calculated based on these n runs.
The reason this grouping makes the measure of importance better is that permuting a feature's value is no longer completely random. In fact, the permutations are done within groups of similar samples, so the permutations are equivalent to noising the samples. This ensures that:

* The permuted feature values are very unlikely to be replaced by unrealistic values.
* A feature that is predictable by features among the chosen n*k features will be replaced by very similar values during permutation. Therefore, it will only slightly affect the model performance (and will yield a small FLOFO importance). This solves the correlated feature overestimation problem.