Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to adapt the transformation function to account for variable sequence length? #12

Closed
grzechowiak opened this issue Apr 12, 2022 · 4 comments
Assignees

Comments

@grzechowiak
Copy link

I am trying to use TimeSHAP on my use case. Per my understanding, in AReM example, the way you transform the data using the df_to_numpy function is to make a prediction for the last value of the sequence – see the screen below:

image
In the case of AReM tutorial data, the predictions are based on the whole sequence - all rows (rows ID 1-10) are being used for sequence ID 1 (light blue color) and the predictions are made for the Timestamp 10 (dark blue color; rows id 10). Later the light orange color is used (Row IDs 11-20) to predict a label marked as dark orange color (Row ID 20).

In the case of my use case, the model predicts on a rolling-window basis and I would need predictions for every row (not only for a sequence). See the screen and explanation below.
image
Let's say my rolling window is 6 and Row IDs 1-6 (light green) are used to predict row 7 (dark green), later Row IDs 2-7 (light grey) are being used to predict Row ID 8 (dark grey), etc. When a new Sequence starts, we repeat the process, so we take Row IDs 11-16 and predict Row ID 17, etc. For my use case, it's important to evaluate the predictions for every Row ID, not only for the whole sequence.

The problem which I am facing is that when I try to run the function get_avg_score_with_avg_event on the data defined as in the picture above I am getting the following error:
image

The way my data is transformed from 2D into 3D format is defined by the function below:
image

My question is whether it’s possible to make TimeSHAP work for the data which is transformed in a way described in my use case? When I use the transformation which is defined in your function df_to_numpy, I am not getting an error, however, it is not adapted to my use case.

@JoaoPBSousa JoaoPBSousa self-assigned this Apr 13, 2022
@JoaoPBSousa
Copy link
Collaborator

Hi @grzechowiak,

Regarding your rolling window setup, TimeSHAP does not currently implement anything to accommodate that directly. The way to emulate your desired behavior, is to divide a sequence into the respective sequences you want to explain. Considering your example, the sequence with ID 1, needs to be divided into 4 sequences to be individually explained: Row IDs: 1-7, 2-8, 3-9, 4-10.

Regarding the issue with get_avg_score_with_avg_event, this method calculates the average event and then passes it to the model repeatedly. The first element that is passed to the model, is a single event, with shape (1, 1, 25), followed by the same event repeated with shape (1,2,25) etc. In your case it seems like the model is expecting a fixed sequence length of 50, (1,50,25), and the method get_avg_score_with_avg_event is providing a single event.

Finally, from what I can understand, TimeSHAP can work with your described use case, as TimeSHAP explains each sequence individually and can work with any sequence length. To help you with the split_sequences method I need more information, regarding the sequences variable and the output of the method.

@grzechowiak
Copy link
Author

grzechowiak commented Apr 15, 2022

Hi @JoaoPBSousa,

I have created some mock data as well as a simple LSTM model which is available on my github here: link.

There are two files: first python notebook mock_example_my_use_case.ipynb is based on the function which is required by my use case (described in the threat above), while the second python notebook mock_example_ts_use_case.ipynb is running on the same data but using a similar function described in AReM tutorial (this is for demonstration purpose, showing that the way data is transformed cause an error in the TimeSHAP).

How can I adapt the transformation function / TimeSHAP in order to make it work on our data and a rolling window setup?

@JoaoPBSousa
Copy link
Collaborator

Hi @grzechowiak,

I looked at your repo and the only thing I could find is that the Timestamp column was being used as a feature to the model but not used on TimeSHAP. This means, when calculating the average event with only feature_1, feature_2, feature_3, and passing that to get_avg_score_with_avg_event, it threw an error as TimeSHAP was missing the Timestamp feature.

In order to fix this issue that are two options depending on your use-case:

  • In case the Timestamp is a feature for the model, add it to the variable model_features and it works;
  • In case the Timestamp is not supposed to be used for training, remove it from the training data. You can do this in your method split_sequences with the line seq_x, seq_y = sequences[i:end_ix, 1:-1], sequences[end_ix-1, -1].

Note: TimeSHAP (and SHAP) is design to explain the difference between a baseline score and the score of the instance being explained. I noted that all the (mock) sequences scores are really low (max 0.08477464 on the training sequences) with baseline (0.08404713). Explaining these sequences might render strange event and feature level explanations, and throw an error on the pruning algorithm which we have not yet addressed.

Hope this answer was helpful. If you have any further questions don't hesitate to contact.

@JoaoPBSousa
Copy link
Collaborator

Closed this issue due to inactivity. If you have any further questions feel free to re-open the issue or create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants