Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of 'd': is d 0 <= d < 1 or 0 < d <= 1? #222

Open
pschil opened this issue Sep 6, 2023 · 0 comments
Open

Definition of 'd': is d 0 <= d < 1 or 0 < d <= 1? #222

pschil opened this issue Sep 6, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@pschil
Copy link
Collaborator

pschil commented Sep 6, 2023

Background

d in the pnbd dyncov model is the number of periods until the end of the covariate interval containing a transaction ('ceiling_date' on the time unit). Its definition remains an ongoing issue because the math builds on continuous time what leaves it open how transactions on the covariate interval boundary should be treated.

Theory assumes continuous time what makes it a matter of definition whether a time point that lies on the boundary of a covariate interval counts to the first or the next covariate interval. Hence, d can either be 0 <= d < 1 or 0 < d <= 1. What we overlooked for a long time, is that the correct definition of d depends on whether a transaction on the interval boundary counts to the interval or not. This should conclusively determine the definition of d.

Due to the discrete temporal resolution (there is only finite precision on computers), all covariate intervals are closed. A transaction can hence only count towards one covariate interval but never to two covariate intervals at the same time. A transaction occurring on what is defined as the lower or upper boundary of this closed interval counts towards the covariate interval it is in. In other words, interval boundaries never "overlap" and a transaction always belongs to a single interval.

Practical relevance

Running on the example data provided in the package and code on the man page of SetDynamicCovariates with change_on_boundary=True/False:

  • Considerably different coefficients estimated, but mostly for the dropout covariates (and much more for negative values)
  • Nearly double the runtime and not both KKT conditions fulfilled for change_on_boundary=False hints at conversion issue
  • Slightly different final LL values
  • Results are reproducible, hence parameter variations are not due to model identification issues

Results

change_on_boundary=True: 
              r           alpha               s            beta  life.Marketing     life.Gender    life.Channel trans.Marketing    trans.Gender   trans.Channel 
           1.65           40.05            0.36            5.58           -0.16            0.76           -1.76            0.45            1.26            0.34

change_on_boundary=False: 
              r           alpha               s            beta  life.Marketing     life.Gender    life.Channel trans.Marketing    trans.Gender   trans.Channel 
           1.62           34.66            0.40            7.46           -0.93            0.71           -1.28            0.43            1.13            0.33

change_on_boundary=True: 
value fevals gevals niter convcode kkt1 kkt2 xtime
4926.152   2203     NA    NA        0 TRUE TRUE 42.75

change_on_boundary=False: 
value fevals gevals niter convcode  kkt1 kkt2  xtime
4931.944   2737     NA    NA        0 FALSE TRUE 82.653

Example

Using years rather than weeks because start/end of weeks is ill-defined and itself often a source of confusion.

Given the covariate intervals for 1999 and 2000 which span [01-01-1999, 31-12-1999][01-01-2000, 31-12-2000], what is d associated with a transaction on

  • 01-01-1999 (lower boundary): either d=1 or d=0
  • 31-12-1999 (upper boundary): d=1/365, always

Resolving this issue

What we know that could help nail this down

  • all intervals are closed
  • transactions on the upper or lower boundary are both always counted towards the interval they are in
  • as we move backwards from 31-12-1999, d increases and then either jumps to 0 at 01-01-1999 or "continuously" increases to 1
  • a parameter recovery study (better recovery, better sum LL value) could help

What should be noted is that the question revolves around the interval lower boundary: Because the upper boundary does not overlap with the lower boundary of the next interval, the difference to it is always >0 (31-12-1999 in the example). The question is whether for the lower boundary d should be d=0 or d=1.

Appendix: change_on_boundary

Adding to the confusion is that the time unit boundary in lubridate, including in ceiling_date(), refers to what is the lower boundary of the covariate interval. This is because the intervals are defined to start from and include lubridate's time unit boundary (01-01-1999 in the example).

@pschil pschil mentioned this issue Sep 6, 2023
@mmeierer mmeierer added the enhancement New feature or request label Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants