Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use categoical and dynamc features by default in DeepAR #144

Open
rshyamsundar opened this issue Jun 21, 2019 · 4 comments
Open

Use categoical and dynamc features by default in DeepAR #144

rshyamsundar opened this issue Jun 21, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@rshyamsundar
Copy link
Contributor

Currently DeepAR does not use categorical and dynamic features by default even though they are present in the dataset. The flags use_feat_dynamic_real and use_feat_static_cat are set to False by default in DeepAREstimator. This is very bad for us in terms of results since people usually run methods with their default options or miss setting these flags explicitly.

There were a couple of arguments against not using them by default but there are better remedies for them than not silently running DeepAR with incorrect options and returning less accurate results:

  • Issue with not setting cardinality and still using feat_static_cat: we can make cardinality argument compulsory (or ideally derive this from data)

  • If all time series do not have same features: there is no harm in failing with suitable error when the data is not consistent.

Which case is important/priority for us right now: Running smoothly even if the data is not properly formatted or running with correct options that succeeds only when the data is consistent?

@rshyamsundar rshyamsundar added the enhancement New feature or request label Jun 21, 2019
@mbohlkeschneider
Copy link
Contributor

Which case is important/priority for us right now: Running smoothly even if the data is not properly formatted or running with correct options that succeeds only when the data is consistent?

I think in case of GluonTS, we aspire to make a scientific library. Thus, I think the algorithms should fail if there are issues in the data. That informs the user that something is not right. Otherwise, you are left wondering why your results are not as good as you are expecting, especially if something is silently not used/discarded/filtered. I think this behavior should be avoided throughout the code base.

@benidis
Copy link
Contributor

benidis commented Jun 21, 2019

I have started looking at this issue the last two days and it is a combination of addressing the input format question and defining the correct transformations behaviour. I agree with Michael that we should not do things silently and if something is wrong we should throw an error instead of trying to filter it internally. However, this opens more questions:

  1. Should we check if all the fields are correct in a dataset (probably while creating windows) and throw an error if not? This adds some complexity since it needs to be applied at each created window.
  2. What should we do with custom fields in a dataset or with fields that are not used by the model, especially with the ones that can break the code, e.g. Dynamical Feature with Different Length will Throw Exception #94 (note that the fix is not global but only for deepar - any other estimator can fail with the same issue)?
  3. Setting aside deepar and looking at the bigger picture, what should be the behaviour of all estimators regarding the input data? Should they always take into account (or at least have the option to do so) a field that appears in the dataset or should they use only prespecified fields regardless of the input data as we were doing up to now?

For the cardinality question I think inferring it from the data in an efficient way is ideal but probably not possible. I think that an informative error message would do the job. Something like: "You are using categorical features but you have not set the cardinality hyperparameter correctly". For the flags part on deepar I have exactly the same opinion (default should be to use the feature since people usually do not know or do not bother to change these values).

@lostella
Copy link
Contributor

lostella commented Jun 21, 2019

I think the ideal solution would be to drive what is being used from the data (and therefore expected in the data) using schema-like structures like the following:

{
    'start': {},
    'target': {'shape': ()},
    'feat_dynamic_real': {'shape': (1,)},
    'feat_static_cat': {'shape': (3,), 'cardinality': [4, 5, 6]}
}

This could be used among other things to configure the transformation chain: the keys in such dictionary will tell you what fields are expected to be in the data. Using this schema-like dictionary,
estimators can do many things:

  • They can assume a minimal schema {'start': {}, 'target': {'shape': ()} unless a different one is specified; this would pretty much amount to the current behaviour, with the difference that the user will be able to specify everything about the data in one single object (instead of potentially 4 flags and 2 cardinalities)
  • Or, we could decide of inferring such a schema from the training data, as soon as training is triggered.
  • Given such a schema, one can use it to validate a DataEntry or a whole Dataset.

Constructing such a schema from the training data would require a full pass through the dataset, not only looking at which fields are there, but also looking for the maximum of all categorical features (to get the cardinality of their domain). But this doesn't seem too bad to me.

There are some structures in the codebase that aim at something similar I think (cfr. MetaData). I'm working on a POC for this, I'll send it around when I'm satisfied with it :-)

@sujayramaiah
Copy link

@lostella Can you please confirm if you were able to complete the POC?
Your solution would make using DeepAR much more easier provided we input the data in correct format without having to worry about setting multiple flags. It would also be good if we can log what are the dynamic features and categorical features being used by the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants