Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNBD Dyncov Rcpp #205

Merged
merged 79 commits into from
Sep 6, 2023
Merged

PNBD Dyncov Rcpp #205

merged 79 commits into from
Sep 6, 2023

Conversation

pschil
Copy link
Collaborator

@pschil pschil commented Apr 4, 2022

Definitions

  • d_: The number of periods from a timepoint until the end of the covariate period it is contained in (ceiling_date). From theory which assumes continuous time, it is a matter of definition whether a timepoint that lies on the covariate boundary counts to the first or the next covariate interval and d hence is either in (0, 1] or in [0, 1). It is currently implemented such that d is in (0, 1].
    • d_omega: d of first transaction (coming alive)
    • d_1: In a walk, it refers to d of the start of the walk (first transaction)
    • d_T: despite its name, dT refers to d of t.x rather than T and differs by customer. It is the d1 of the aux walks (1 because it refers to the first transaction of the walk). The d in DECT/ABCD are actual d_T which describe d of T.
  • tjk: Number of periods between the transactions defining a walk. Its range is >=0 and equals 0 only for the aux walks of customers who have their last transaction on the estimation end (t.x=T). Other transactions cannot be on the same time point because they would be merged to a single transaction.

Walks

Walk Definition

Aux walks (life and trans):

Covariates [t.x, T] = Covariates from last transaction to estimation end. A single walk per customer, of the same length for life and trans. Guaranteed to exist and such that they always contain the covariates where t.x and T are in. Created as if there is an artificial transaction at T.

Real life walk

Covariates [0, t.x] = Covariates from the first until last transaction. Only ever used in Di(). It may not overlap with the aux walk to avoid double counting in Di() and is therefore defined as all covariates before the aux walk (the "difference" of the aux walk [t.x, T] and the walk [0, t.x]). As a consequence, there is no real lifetime walk for customers having t.x in their first covariate period (mostly zero repeaters).

Real trans walk

Covariates [t_j, t_k] = Covariates from one repeat transaction to the next. There is no walk up to the first transaction and hence there are no walks for zero-repeaters. The number of walks varies per customer.
A covariate may appear twice (in two subsequent walks) if a transaction is on the covariate boundary because the transaction is used to define two walks: Transaction j is upper and lower boundary of subsequent walks, i.e. [tp_i, tp_j] and [tp_j, tp_k].

Walk Data

Walk data is stored in 4 separate data tables, in long format to facilitate passing it as a matrix to Rcpp

  • data.walks.life.aux: Single walk per customer.
  • data.walks.trans.aux: Single walk per customer.
  • data.walks.life.real: Single walk per customer, if exists.
  • data.walks.trans.real: As many walks as repeat transactions (none for zero-repeaters)

The tables are sorted by customer id, covariate date. Hence it is guaranteed that data belonging to the same walk comes in subsequent rows.

Columns

  • abs_pos: Index position of a row in the data.table, strictly monotonically increasing. Used to mark start and end of each walk
  • Id: customer identifier
  • walk_id: Identifier for a single walk in the table. All data marked with this id belong to the same walk
  • walk_from / walk_to: Start and end indices of the respective walk in the table, as marked by abs_pos. Constant per walk
  • tp.this.trans: Timepoint of later transaction that defines the walk
  • tp.cov.lower: Lower bound (incl) of the covariate period. Cov.Date in the original data given by user.
  • tp.cov.upper: Upper bound (incl) of the covariate period
  • tjk & d1: Transaction walks only. Both are constant per walk
  • < cov data cols > columns containing the actual covariate data

Walk Creation

All covariate data is defined for a closed interval [lower, upper] during which it is active, such that there are no gaps between covariates and subsequent covariates never overlap, i.e. [lower1, upper2], [lower2, upper2] where lower2=upper2+eps. It has to be stored as closed interval to ensure that it can be matched to the relevant transaction data. For every covariate (row), the upper and lower boundaries of this period is given, hence the covariate is active in the period [tp.cov.lower, tp.cov.upper].

A single walk is defined by 2 transactions and contains the covariate data that is active from the first to the second transaction, incl. the covariate on the day of the transaction itself. A walk is created by matching the interval from the first to second transaction of the walk to the interval of the covariate data using an interval join (foverlaps(type='any', mult='all')).

Transactions on the same timepoint are aggregated in clvdata so that there are no walk of length 0, but if the last transaction is on estimation end (t.x=T), the aux walk has start and end on the same timepoint. The aux walk then only consists of the covariate data active on this timepoint.

Differences to previous Implementation

Dyncov LL

  • Di, Bi, BkSum, BjSum: Testcase revealed 1-2 bugs related to how walks are summed for i={1,2} and short walks with only {1,2} elements. This produces slightly different results.
  • "Cheating for stability" section at the end of the LL was removed which overwrote non-finite values with the abs largest of finite values if there are <5% such values

Walk creation

  • ceiling_date(): In order to jump when on the covariate interval lower boundary, the parameter change_on_boundary is used instead of always adding +1 to the timepoint before applying ceiling_date() as was done before. This leads to slightly different d1 and tjk. Note that covariate intervals are defined to include the start from the time unit boundary, ie 01-01 is the covariate interval lower bound.
  • tjk for t.x=T: Slightly different because eps is no longer added to t.x
  • d1 and tjk: for customers with transaction on T (t.x=T)

Bug Fixes

  • Fixes lost aux walks (ref)
  • Fixes usage of newdata when predicting less than 2 period (ref)

Rcpp Implementation

LL

  • Calculate LL by customer to facilitate reasoning (vs math) and parallelization
  • Use customer and walk data structures (classes) instead of a purely vector based implementation for improved understanding and readability. It also abstracts away the memory representation.
  • The calculation exp(gamma*X) is done over the whole passed walk data matrix and not per customer
  • method pnbd_dyncov_getLLcallargs() can be used to get the required args to call LL

Classes

TransactionWalk, LifetimeWalk, EmptyLifetimeWalk

Represents covariate data of a single walk

  • store calculated exp(gX) data as (1xk) arma::vec, k=length of the walk
  • storing a subview instead of arma::vec leads to copy/move constructor issues
  • provide methods to access and sum over stored walk data
  • EmptyLifetimeWalk: If there is no lifetime walk
  • TransactionWalk inherits from LifetimeWalk with additional members for d1 and tjk

Customer

All data, inkl walks, related to a single customer

  • Store x, t.x, Tcal as doubles
  • Each type of walk is stored as a separate Walk object, a vector of walk for the real transaction walks
  • Separate constructors depending on whether the customer has real transaction walks or not (zero-repeater or not).
  • Customer constructor takes only raw data on walks an instantiates the required walks

Passing Data to the LL

  • Generally, data belonging to same customer is stored on the same position (row in matrix, index in vectors).
  • The walk covariate data is passed as stored in the respective columns in the walk tables: As (very) long matrices, which allows to efficiently compute exp(gamma*X) over the whole passed walk data matrix and not per customer.
  • Because the length of each walk varies, additional 'walk info' is given that specifies for each customer where their walks start and end in the walk data matrices.
  • Customer objects are instantiated from the passed data by iterating over the rows of the passed vectors/matrices

walkinfo_{aux, real}_{life, trans}

  • nx2 or nx4 matrix: For each customer, marks from/to of the respective walk in the covariate matrix in the first two columns
  • for transaction walks: nx4 matrix with d1 and tjk in the additional columns
  • if there is no real transaction walk for a customer (zero-repeater), the entries on this row in the walk info are non-finite (NA) (and there is no data in the covariate matrix)

Timing insights

image

  • Runtime is mostly driven by calls to hyp2f1() which cannot be reduced as they are in the GSL library
  • many zero-repeaters have long aux walks which increases the number of calls to hyp2f1 as part of F2.3 and therefore are a major driver of runtime

Recovery Results

From 50 simulated datasets, each 104 weeks

image

Testing

  • Define test cases for walk creation (intermediate steps and overall) and Bi/Di/BkSum in separate excel file in tests/testthat/dyncov_LL_reference.xlsx
  • Define tests for the Rcpp functions, namely of Bi(), BkSum/BjSum(), Di() and set up the required infrastructure
  • Enable runability tests for dyncov

Future improvements

  • Could parallelize evaluation of LL across customers with OpenMP but correct setup effort in package seems considerable. The data.table package could be a starting point to look at.
  • Call LL only for unique customers (incl with equal walk data) and multiply individual LL for the number of customers

Discarded Ideas

Discarded because it is hyp2f1() that drives runtime

  • retain all data structures and allocated memory (incl memory of exp(g*X)) in-between calls to LL. Repeatedly creating and discarding all walk and customer objects with each call of the LL has (surprisingly) a relatively small impact, likely because not much memory is allocated. Keeping these objects across LL calls was deemed not worth the additional implementation effort because hyp2f1() is the main driver
  • use more light-weight structure than vector<> in customer
  • do not copy x, tx, Tcal into customers but point to vector elements
  • sum(LL) as running sum to avoid allocating a potentially large result vector
  • do not return Rcpp data types but only arma::vec (and pass intermediate results to mat given by ref)

Open, scoped out issues

  • bkT == (C * T.cal) does not hold. The related test is disabled. However, bkT == C*(T.cal-1/7) does hold. This hints that either T.cal is calculated wrongly or some of the constituents of bkT which relies on date calculations such as dT or t_x is wrong by 1 day. Trying change_on_boundary=T/F for dT did not resolve the issue.
  • Fitting the model for POSIXct based time units such as hours is still disabled. It is untested if walk creation works correctly for these time units
  • Time unit 'day' uses the Date class which, depending on its definition, leads to d_ in dyncov to be only 1 or 0, respectively.
  • The definitional issues surrounding d_ are tracked in a separate issue Definition of 'd': is d 0 <= d < 1 or 0 < d <= 1? #222 .
  • Tests: Add a parameter recovery test

@pschil pschil marked this pull request as ready for review September 1, 2023 23:14
@pschil pschil merged commit e6ce4b4 into development Sep 6, 2023
9 checks passed
pschil added a commit that referenced this pull request Oct 23, 2023
* Add slash to name
* Fix Pareto/NBD print name (#201)
* Update author name
* Formula interface (#203)
* Github actions: Upgrade to v2 (#214)
* clvdata: Plot Transaction Timings (#213)
* Housekeeping: Makevars (#217)
* Run tests in parallel (#216)
* Draw other models in same plot (#215)
* PNBD Dyncov Rcpp (#205)
* Update docu: continuous discount rate (#223)
* Drop explicit c++11 specification
* Prepare release
* Fix CRAN CPU usage & fix docu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant