Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monthly data with seasonality issue #1110

Closed
germayneng opened this issue Aug 13, 2019 · 9 comments
Closed

monthly data with seasonality issue #1110

germayneng opened this issue Aug 13, 2019 · 9 comments

Comments

@germayneng
Copy link

germayneng commented Aug 13, 2019

Referencing #823

I am currently facing the same issue. Referencing the above, it makes sense why for monthly forecasting with yearly seasonal:

  • the yearly column will repeat after every 4 years (48 data points) instead of 12

In order to try to combat this, an approach that i went with is to:

  1. treat my monthly data like daily. i.e just convert the ds into a daily time series
  2. disable all seasonality
  3. add my own custom seasonality to period = 12, with 10 fourier components

Then continue to forecast as usual and change the ds column back to monthly in forecast dataframe.

So this approach results in the custom column having a cycle of 12. i.e it repeats every 12 months. But comparing using walkforward cross validation, this approach seems to be slight slight worst off as compared to the default forecast (with the yearly seasonality of period 365.25)

So it is apparent that both methods are slightly different ( though we are only considering trend and trying to capture the yearly seasonal component). The first approach (default) results in a 4 year seasonal pattern while the second one (monthly as daily) is capturing a recurring yearly pattern. The issue i had with the first default (due to leap year) results in this:

This is the seasonality component from yearly column. notice that at the end of this year, the fall in units is not as much as the previous year. (which is due to the leap year issue)
image

and the second method (as described above) results in this: (seasonality with period =12 , 10 fourier order on my 'daily' (aka monthly data) - the xaxis are daily but its the same montly data
image

but cross validation wise, the first default seems to be slightly better. But i would prefer the second method. (because we are expecting dec 2019 to have a fall in units, but from the yearly fourier, the fall is 'damped')

Is there any advise you can give regarding this issue? is the daily method correct? and will i introduce any issues?

edit: Tested this method on various data we have. For some, daily method is better. But in terms of validation error, they are not too far apart.

edit2: Can i ask if the yearly seaonality is working as intended for monthly data (since this 365.25 parameter is to capture annual seasonality from a daily data perspective)

@bletham
Copy link
Contributor

bletham commented Aug 14, 2019

These approaches will be almost the same thing (if months were evenly spaced out and there were no leap years, they'd be totally identical).

Seasonalities are specified in a unit of "days", but internally in the model there is nothing special about "days". Actually, internally time is converted to be a float between 0 and 1, and all of the model fitting and predicting happens in this transformed unit. This means that changing the data frequency by any constant will not in any way change the forecast. Here it does change the forecast slightly because the frequency was not changed by a constant since some months have different numbers of days, whereas after your transformation they are all evenly spaced.

Is there a month missing from the data? I'm trying to figure out why there would be such a large spike, even when you've converted it to daily seasonality.

Perhaps a better option for learning yearly seasonality with monthly data would be to just add a binary extra regressor for each month. That is, disable all seasonality and add a binary regressor "is_jan" that is 1 if the month is Jan, 0 otherwise. And so on for each month. No need to scale data dates. What does that look like?

@germayneng
Copy link
Author

germayneng commented Aug 15, 2019

Thanks for the thorough explanations @bletham . Let me elaborate on some points you mentioned and provide more experiment results.

These approaches will be almost the same thing (if months were evenly spaced out and there were no leap years, they'd be totally identical).

Seasonalities are specified in a unit of "days", but internally in the model there is nothing special about "days". Actually, internally time is converted to be a float between 0 and 1, and all of the model fitting and predicting happens in this transformed unit. This means that changing the data frequency by any constant will not in any way change the forecast. Here it does change the forecast slightly because the frequency was not changed by a constant since some months have different numbers of days, whereas after your transformation they are all evenly spaced.

A question that i have is that because from a daily data perspective, a yearly seasonality is to capture patterns from a yearly frequency. So 365.25 observations per cycle makes sense.
However, using a monthly data, i believe a correct way of looking at the data will be to have 12 observations per cycle (freq = 12) and hence the seasonality we aim to capture should be every 12 observations. But the current implementation (default) results in a different frequency as mentioned by you. From a monthly data perspective, the number of days in a month should not matter (right?)



Is there a month missing from the data? I'm trying to figure out why there would be such a large spike, even when you've converted it to daily seasonality.

No missing data. The data used for training is at a monthly level. From 2016-06 to 2019-05. Not sure if this matters but the 'ds', i have used 01 for the days . So the data goes by: [2016-06-01,2016-07-01,.... 2019-05-01] ( i understand there is a leap year effect. 2016-06-01 ! = 2017-06-01)

To add on, on the contrary, we are expecting a large spike around dec. It is because the default yearly seasonality is not return a consistent fall in units at dec, especially 2019 which prompt this investigation and made me realized that the yearly seasonality is fluctuating on the dec part. The intuition we want is that dec should always have a sharp decline in units but 2019 seems to result in a decline that is not as much as expected.

Also, might be a typo on your end but the daily method i actually used a custom seasonality of 12 so it shouldnt be a 'daily seasonality'



Perhaps a better option for learning yearly seasonality with monthly data would be to just add a binary extra regressor for each month. That is, disable all seasonality and add a binary regressor "is_jan" that is 1 if the month is Jan, 0 otherwise. And so on for each month. No need to scale data dates. What does that look like?

So here is the interesting finding. By disabling the yearly seasonality and instead add dummy regressors for the months results in a similar forecast as the 'daily method'. Both methods are able to give a consistent seasonal pattern (i.e the dip is the same for dec for example) as opposed to default prophet with yearly seasonal (i.e dip in dec is different on a per year basis but same every 4 years)

For illustration, you can see the forecast given by:

  1. daily method - again its just remapping the ds to daily scale but its a monthly data with seasonality 12

image

  1. regressors method with dummy regressors to capture yearly seasonality

image

for comparison, default prophet with yearly seasonality (period = 365.25)

image

@bletham
Copy link
Contributor

bletham commented Aug 17, 2019

365.25 isn't the number of observations per cycle, it's just the length in real-time of a cycle. So, yearly seasonality fits a periodic function with a period of 365.25 days. That actual function is modeled using a Fourier series (https://en.wikipedia.org/wiki/Fourier_series) and so is continuous-time. That continuous, 365.25-period function can be fit with any number of observations, whether daily or monthly (though of course more data will allow for more reliable fitting, and monthly observations will leave many parts of the function not-pinned-down, as described in https://facebook.github.io/prophet/docs/non-daily_data.html#monthly-data ).

I think what is happening here is that it is clearly a very sharp spike, and since the location of Dec 1 in the 365.25-day cycle varies from year to year (it falls back by 0.25 days every year until a leap year), that 0.25-day difference from year to year is actually producing noticeable changes. I suspect this is due to a combination of a very sharp seasonal effect, combined with the monthly data that make it hard for the model to identify the continuous-time yearly seasonality. This is the first time I've seen a noticeable effect from leap years in the yearly seasonality. I would definitely recommend the extra regressor approach year, and perhaps we should add to the documentation since it may be a better approach typically for monthly data.

@bletham bletham closed this as completed Sep 28, 2019
@sammo
Copy link

sammo commented Oct 1, 2019

I'd think that keeping yearly_seasonality=True and adding another custom seasonality of period=1461 would do a good job with this data. Have you considered that @germayneng ?

@germayneng
Copy link
Author

germayneng commented Oct 2, 2019

interesting. Not really. I think the above 2 solution is much more intuitive. Either model directly as a daily data or just add seasonal dummy. In fact, theoretically, seasonal dummy achieves the same as Fourier regressors and more beneficially when you are working with higher frequency data.

@sammo
Copy link

sammo commented Oct 2, 2019

I agree that the methods you suggested are more intuitive to you, because you thought of them in the first place :)
The time-continuity of Fourier Series Ben mentioned is tricky though, as he explained the falling back of .25 day each year. The dummy variable is surely a good idea. But if you get to try the 1461 custom seasonality I'd like to know how it does. I don't have the luxury of working with such a clean dataset...

@germayneng
Copy link
Author

i apologize since i am not able to understand the purpose of 1461 period. My guess is by setting the Fourier period to 1461, we are trying to add additional seasonality effect every 4 years? is the purpose to compensate for the difference?

@germayneng
Copy link
Author

germayneng commented Oct 2, 2019

Here are results anyway: So i did what was suggested: adding yearly seasonality + additonal fourier with period 1461.
I plotted the components separately as well as the last which combined the 2 components

image

image

image

In this case, the leap year effect is not eliminated. (which is something i wanted to do in the first place, since i want a consistent seasonality effect throughout my forecast instead of a leap year seasonality)

@sammo
Copy link

sammo commented Oct 4, 2019

hmm I see, thank you a bunch for sharing the results. I was wondering if adding a seasonality component for 4 years would handle the difference but it clearly doesn't. I went over #825 you referenced in your initial comment and dug into the code and now I get why it doesn't.
I initially misread Ben's comment and thought the time axis is spreading the leap year effect over the other three years by using 365.25 as a period but going over the code for seasonality it's not...
So the number of seconds in each of the first four years in the data would be:

import pandas as pd
year = 2013
for i in range(4):
    t = (pd.datetime(year + 1 + i, 1, 1) - pd.datetime(year + i, 1, 1)).total_seconds()
    print(f"Time delta in seconds from Jan 1 {year + i} to {year + i + 1}: {t} seconds.")

Time delta in seconds from Jan 1 2013 to 2014: 31536000.0 seconds.
Time delta in seconds from Jan 1 2014 to 2015: 31536000.0 seconds.
Time delta in seconds from Jan 1 2015 to 2016: 31536000.0 seconds.
Time delta in seconds from Jan 1 2016 to 2017: 31622400.0 seconds.

This means the 2016 - 2017 difference is what lets the yearly effect catch up in the yearly column with the 365.25 periods...
Agreed then the external regressors for each month makes total sense then... Thanks again for trying it and posting the graphs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants