Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with data of many 0s and 1s and very few 2,3 and 4s #1153

Closed
Vidyapreethaam opened this issue Sep 26, 2019 · 8 comments
Closed

Working with data of many 0s and 1s and very few 2,3 and 4s #1153

Vidyapreethaam opened this issue Sep 26, 2019 · 8 comments

Comments

@Vidyapreethaam
Copy link

I am trying to forecast daily sales for a product and the demand value is just 0s and 1s(majorly). If i try to use prophet, then Im getting values that are decimals, is there any work around this while using prophet.

@APramov
Copy link

APramov commented Sep 27, 2019

@Vidyapreethaam, I don't see an easy way to do that. I think that the assumed data generating process is continuous, as the error term is assumed to be normally distributed.

I guess it might be possible to do something if you change the distribution of the error term to be something else that fits your DGP better (a poisson whose lambda you model as a function of time in the same manner as the mean is modeled in Prophet? idk...), but frankly I am not sure how the whole structure would hold in your case that seems to be a (zero-inflated?) count process. I'd be really curious to hear @bletham 's take on this, maybe I am missing something. I don't have direct experience with such time series.

In the meantime, you might want to take a look at literature and models for time series analysis for (zero-inflated?) count processes.

@amw5g
Copy link

amw5g commented Sep 27, 2019

I agree with @APramov that you're going to have a tough time. And perhaps a GAM model, such as prophet uses, isn't the right fit. I'd look into negative binomial or ZIP, as suggested

That said, if you want to persist, you could aggregate your daily numbers up to a higher time frequency, like weekly or biweekly. Learn the prophet model. Then subsequently re-distribute the aggregated counts back down to the daily level. That last step you'll need to do outside prophet, based on your understanding of the domain.

@bletham
Copy link
Contributor

bletham commented Sep 29, 2019

Agreed, the Prophet model will be a bad fit for these data since it uses a Gaussian likelihood. Ultimately we want to support a negative binomial or Poisson likelihood, where basically the Prophet model would represent the latent arrival rate (#337), but it's proven a bit tricky and so you'll have to do something else for now.

I think @amw5g's idea of aggregating to learn seasonality is really interesting, for instance if there is yearly seasonality then aggregation to weekly data would allow you to capture yearly seasonality with Prophet, and then use that in combination with some other model like @amw5g suggests.

@trevor-pope
Copy link

@Vidyapreethaam You could consider using something similar to Croston's Method, which is used to forecast series with intermittent demand (i.e. long, but consistent, periods of zeroes following positive integers). Read more about it here

@APramov
Copy link

APramov commented Sep 30, 2019

Agreed, the Prophet model will be a bad fit for these data since it uses a Gaussian likelihood. Ultimately we want to support a negative binomial or Poisson likelihood, where basically the Prophet model would represent the latent arrival rate (#337), but it's proven a bit tricky and so you'll have to do something else for now.

I think @amw5g's idea of aggregating to learn seasonality is really interesting, for instance if there is yearly seasonality then aggregation to weekly data would allow you to capture yearly seasonality with Prophet, and then use that in combination with some other model like @amw5g suggests.

@bletham , regarding your first point here - I have looking into using Prophet for count data and wanted to see if I can contribute. Is there an overview somewhere of what you guys have done so far/what's left to do on this issue?
Cheers

@Vidyapreethaam
Copy link
Author

Thank you for all those inputs. In regard to this, I have another doubt, so when I use Prophet, and I do it in a weekly level, then the yhat value are as decimals. So, for the right forecasted value do we round them up? as, we can have forecasts for demand as decimals?

@Vidyapreethaam
Copy link
Author

@Vidyapreethaam You could consider using something similar to Croston's Method, which is used to forecast series with intermittent demand (i.e. long, but consistent, periods of zeroes following positive integers). Read more about it here

@bletham
Copy link
Contributor

bletham commented Oct 1, 2019

@APramov the difficulty in changing the likelihood is that the model is specified in two places in the code: it is specified first in the Stan, but then repeated in the R/Py code. That is because model fitting in done in Stan, but then predictions are done purely in R/Py. So making a change to the likelihood to add a new likelihood mode would require implementing this new likelihood model three times: in each of Stan, R, and Py. And the same goes for other customizations that want to be done, like tweaking the trend model.

So the plan for handling this and other new model modes is to change to make model predictions using Stan, so that a change in the model need be implemented only once and we can more easily accomadate a larger number modes. The typical workflow in Stan is to do predictions at the time of model fitting, which seemed too limiting for us and is not the typical API for software of this sort. Stan did not have the possibility to make predictions from a previously fitted model until 2.18, which added the standalone generated quantities feature. However as of the last time I looked into it (early summer), that feature had not yet propagated to rstan/pystan.

So that's the core of the challenge: making predictions from a previously fitted Stan model, despite this not being supported in rstan or pystan. We developed a trick for doing this in #865, where basically at predict time we redo the Stan inference but initialize the optimizer from the fitted values and do only 1 gradient step. This gets Stan to execute the Generated Quantities block, which is where we put the predict code. We actually got this working for producing yhat and its _lower and _upper bounds, but we found that it was very slow for producing posterior samples (needed for the predictive_samples function). The issue seemed to be in data transfer from CmdStan into R. But the person working on it wasn't able to dig more deeply into it, and I thought it best to wait a little and see if the standalone generated quantities feature would make it down to rstan/pystan before investing a lot more effort into this hack.

Standalone GQ did make it into rstan in the latest release, mid summer. So the next step to getting this working is to trying moving predictions into GQ as in #865, but then use the standalone GQ functionality to execute it. I plan to try this out over the next 6 weeks or so and see if I can get it working this time with the new rstan. If it does then adding the new likelihood will be fairly straightforward. If not, then we'll have to decide a new path because a discrete likelihood is definitely the next major feature on the todo list.

If you were interested in working on this feature, then you could take a look at #865, and like I said first step is just to adapt that to use the new standalone GQ in rstan 2.19. I expect it to be a fairly substantial effort, but we could discuss it more on #501 if you wanted to try it out. In any case, I'll be working on it too, and once it's working or not I'll comment on #337 and if you wanted to just implement the new likelihood model that'd be a great contribution too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants