Definition of 'd': is d `0 <= d < 1` or `0 < d <= 1`? #222

pschil · 2023-09-06T21:26:40Z

Background

d in the pnbd dyncov model is the number of periods until the end of the covariate interval containing a transaction ('ceiling_date' on the time unit). Its definition remains an ongoing issue because the math builds on continuous time what leaves it open how transactions on the covariate interval boundary should be treated.

Theory assumes continuous time what makes it a matter of definition whether a time point that lies on the boundary of a covariate interval counts to the first or the next covariate interval. Hence, d can either be 0 <= d < 1 or 0 < d <= 1. What we overlooked for a long time, is that the correct definition of d depends on whether a transaction on the interval boundary counts to the interval or not. This should conclusively determine the definition of d.

Due to the discrete temporal resolution (there is only finite precision on computers), all covariate intervals are closed. A transaction can hence only count towards one covariate interval but never to two covariate intervals at the same time. A transaction occurring on what is defined as the lower or upper boundary of this closed interval counts towards the covariate interval it is in. In other words, interval boundaries never "overlap" and a transaction always belongs to a single interval.

Practical relevance

Running on the example data provided in the package and code on the man page of SetDynamicCovariates with change_on_boundary=True/False:

Considerably different coefficients estimated, but mostly for the dropout covariates (and much more for negative values)
Nearly double the runtime and not both KKT conditions fulfilled for change_on_boundary=False hints at conversion issue
Slightly different final LL values
Results are reproducible, hence parameter variations are not due to model identification issues

Results

change_on_boundary=True: 
              r           alpha               s            beta  life.Marketing     life.Gender    life.Channel trans.Marketing    trans.Gender   trans.Channel 
           1.65           40.05            0.36            5.58           -0.16            0.76           -1.76            0.45            1.26            0.34

change_on_boundary=False: 
              r           alpha               s            beta  life.Marketing     life.Gender    life.Channel trans.Marketing    trans.Gender   trans.Channel 
           1.62           34.66            0.40            7.46           -0.93            0.71           -1.28            0.43            1.13            0.33

change_on_boundary=True: 
value fevals gevals niter convcode kkt1 kkt2 xtime
4926.152   2203     NA    NA        0 TRUE TRUE 42.75

change_on_boundary=False: 
value fevals gevals niter convcode  kkt1 kkt2  xtime
4931.944   2737     NA    NA        0 FALSE TRUE 82.653

Example

Using years rather than weeks because start/end of weeks is ill-defined and itself often a source of confusion.

Given the covariate intervals for 1999 and 2000 which span [01-01-1999, 31-12-1999][01-01-2000, 31-12-2000], what is d associated with a transaction on

01-01-1999 (lower boundary): either d=1 or d=0
31-12-1999 (upper boundary): d=1/365, always

Resolving this issue

What we know that could help nail this down

all intervals are closed
transactions on the upper or lower boundary are both always counted towards the interval they are in
as we move backwards from 31-12-1999, d increases and then either jumps to 0 at 01-01-1999 or "continuously" increases to 1
a parameter recovery study (better recovery, better sum LL value) could help

What should be noted is that the question revolves around the interval lower boundary: Because the upper boundary does not overlap with the lower boundary of the next interval, the difference to it is always >0 (31-12-1999 in the example). The question is whether for the lower boundary d should be d=0 or d=1.

Appendix: change_on_boundary

Adding to the confusion is that the time unit boundary in lubridate, including in ceiling_date(), refers to what is the lower boundary of the covariate interval. This is because the intervals are defined to start from and include lubridate's time unit boundary (01-01-1999 in the example).

The text was updated successfully, but these errors were encountered:

pschil mentioned this issue Sep 6, 2023

PNBD Dyncov Rcpp #205

Merged

mmeierer added the enhancement New feature or request label Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition of 'd': is d `0 <= d < 1` or `0 < d <= 1`? #222

Definition of 'd': is d `0 <= d < 1` or `0 < d <= 1`? #222

pschil commented Sep 6, 2023

Definition of 'd': is d 0 <= d < 1 or 0 < d <= 1? #222

Definition of 'd': is d 0 <= d < 1 or 0 < d <= 1? #222

Comments

pschil commented Sep 6, 2023

Background

Practical relevance

Example

Resolving this issue

Appendix: change_on_boundary

Definition of 'd': is d `0 <= d < 1` or `0 < d <= 1`? #222

Definition of 'd': is d `0 <= d < 1` or `0 < d <= 1`? #222