Robyn_refresh error when collection results #307

ohad-monday · 2022-02-14T19:57:59Z

Project Robyn

Describe issue

issue - when running robyn_refresh with more than 22 incremental days i get this error below

error message:

Collecting results...
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= :
NMF::createStream - invalid value for 'n' [positive value expected]
Calls: robyn_refresh ... robyn_run -> %dorng% -> do.call -> doRNGseq -> RNGseq

Provide dummy data & model configuration

what else is needed?

Environment & Robyn version

R version : R version 4.0.3 (2020-10-10)
Robyn version: Robyn_3.4.8

gufengzhou · 2022-02-15T22:43:40Z

Hey, this is very strange. I can't think of any possible cause out of the box. We'll need a reproducible dataset to replicate this error, including a dataset (true value masked) and your demo.R file for model specification. If you don't want to share publicly, please send it via email @laresbernardo and me (bernardolares@fb.com, gufeng@fb.com)

ohad-monday · 2022-02-20T08:51:36Z

what should i mask the data? By true value you mean features and target?

gufengzhou · 2022-02-21T07:13:34Z

Users usually don't share real data with us but somehow randomised dataset.

ohad-monday · 2022-02-22T09:03:53Z

ok, sent it to you email

laresbernardo · 2022-02-22T18:07:37Z

Hi @ohad-monday
I've just run your example exactly as you sent us (with less iterations and trials) and ran OK. I'm attaching the R file I used with your demo.csv (not included). Please, be sure to update Robyn to the latest version (3.6.0 - released today) and try again.
Note that now we set iterations and trials within robyn_run() (you'll get a warning).
Do let us know if it works for you after updating.
issue_307.R.zip

ohad-monday · 2022-02-23T11:12:39Z

@laresbernardo thank you, im checking right now.
One thing though, did you change something in the modeling part to could cause to very poor accuracy of the models?
i'm running it from the beginning of the pipeline, and the fitted models on the same data are terrible. did you change/add any param i need to add/change during the robyn_run?

gufengzhou · 2022-02-23T11:25:19Z

hey, yes there're actually major improvement in optimisation, see here. Please also check out the new demo.R guide that introduces new functionalities and workflows. You should be also seeing new convergence message after the model runs. How many iterations are you running? Now for the simulated dataset it converges at 1.5-2k iterations.

ohad-monday · 2022-02-23T12:12:00Z

new version - i ran it with 2500 iterations... something is completely off
refresh bug: i ran it again with the new version, still getting the same error:
Finished in 4.04 mins

Running Pareto calculations for 3000 models on 3 fronts...
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= :
NMF::createStream - invalid value for 'n' [positive value expected]
Calls: robyn_refresh ... robyn_pareto -> %dorng% -> do.call -> doRNGseq -> RNGseq
In addition: Warning message:
In check_calibconstr(calibration_constraint, OutputModels$iterations, :
calibration_constraint set for top 10% calibrated models. 300 models left for pareto-optimal selection. Minimum suggested:

gufengzhou · 2022-02-23T12:32:45Z

Can you share this plot with us: OutputCollect$OutputModels$convergence$moo_distrb_plot

ohad-monday · 2022-02-23T13:25:15Z

@gufengzhou I don't see a this plot here:

laresbernardo · 2022-02-23T13:33:10Z

@ohad-monday can you please check on OutputModels$convergence$moo_distrb_plot instead? Is that a the product of a robyn_refresh() or robyn_run()?

ohad-monday · 2022-02-24T08:12:57Z

@laresbernardo sure! (it's from the robyn_run())
here it is:

gufengzhou · 2022-02-24T08:47:25Z

Thanks for sharing. as you see, model hasn't converged, if you compare this with the example plot here. Esp. NRMSE hasn't moved from the right side and it reflects exactly the bad fitting plot you showed before. I recommend you to run higher iterations to try out. You can do for example 5k with 1 trial first to see if 5k converges. The reason for this change of convergence speed is that we've added lambda as extra hyperparameter to enable its automatic optimum selection. We'll fine-tune it over time to improve convergence speed.

gufengzhou · 2022-02-24T10:47:36Z

FYI I've just committed a fix that should accelerate convergence. You should be seeing the "hills" in the plot moving left earlier compared to before. Let us know if it works.

ohad-monday · 2022-02-28T22:53:30Z

@gufengzhou Found the reason you weren't able to reproduce the error i got. the problem is with a model with a 0 coefficient that also has calibration data for this channel. might be somehow related to the mape lift calculation for the nevegrad optimization?

laresbernardo · 2022-02-28T22:57:08Z

Uhh, bummer. Would you be able to send me (laresbernardo @gmail.com) a CSV with anonymized data and the .R file you're using to run Robyn so I can replicate exactly your issue? It'd be really useful to debug this error in case you're not able to do it yourself and make a pull request. I'd be happy to check given that Gufeng will be on leave for some months.

ohad-monday · 2022-03-01T07:57:46Z

@laresbernardo thanks! i shared an updated params.R in email with calibration data. the refresh supposed to end with an error. than if you will exclude the f6 feature from the calibration data, it will work.

JiaMeihong · 2022-03-10T04:26:49Z

Hi @laresbernardo @kyletgoldberg
I also use the new version of Robyn, and found calibration result very poor. My actual and predicted result looks very similar to this .
I tried increasing iterations to 5000 ~ 7000, but the it won't converge still. What's even more strange to me is that, the trained R2 is negative.
Is there any update over this issue?

Correct me if I'm wrong, my understanding is that previous version 3.5.0 choose best lambda of Ridge regression using 10 fold cv on window period. What has changed in the method to tune lambda in 3.6.0 version? and why do you decide to make the change?

Thanks very much for your help! I'm really learning a lot from your product development and discussion here!

kyletgoldberg · 2022-03-10T14:07:43Z

@JiaMeihong could you check to see if any of the calibration data you are using is also corresponding to a channel that was assigned a 0 coefficient? Seems like that was causing the issue for @ohad-monday so that would be a good place to start.

JiaMeihong · 2022-03-11T09:42:38Z

@kyletgoldberg Hi, thanks for your reply! I've checked the coeffecient in "pareto_aggregated.csv". The channels of interest don't have a 0 coefficient. Though after iterations it seems that results are still far from convergence, all channels in my model result have positive coef.

kyletgoldberg · 2022-03-14T16:13:36Z

Thanks for taking a look - how many different calibration inputs are you using? this is a difficult one to replicate on our end since we don't have data to recreate, but if you don't have too many calibrations, would it be possible to take them out one at a time and see if it is one particular calibration input causing an issue? I suspect some interaction between the calibration and the lambda hyperparameter optimization is causing the issue

JiaMeihong · 2022-03-16T02:14:12Z

@kyletgoldberg
Thanks for your suggestion!
I've tried input just one media one testing period into calibration, but the result is still not working out.

My calibration data is by month, but my input data is by week. Does that possibly impact the result?

Could you briefly explain what has changed in this Robyn version 3.6.0 about calibration? Cus when using previous version with calibration, the result worked fine. I think by understanding the logic of change may help me to detect the bug further.

Thanks!

kyletgoldberg · 2022-03-16T14:35:15Z

@JiaMeihong That shouldn't be an issue - nothing changed in 3.6 with respect to calibration, but in 3.6 we added lambda hyperparameter to the nevergrad optimization rather than using the CV method. This should lead to better results by allowing more flexibility in learning the hyperparameters, but also seems to be leading to some convergence issues when paired with calibration data at times.

It seems like what is essentially happening here is that we are running into cases where the calibration data is so at odds with how the model wants to fit that it is never going to converge, which is a difficult problem to solve.

I saw you had edited a comment that had initially said that removing one of the calibrations with the lowest lift had got the model to work again - was that the case? It could be worth digging into how exactly the test was set up vs. how the data is collected for the channel - i.e. does the test encompass all of the spend you are measuring? etc. to ensure that they are as aligned as possible. Would be great to understand any info you have on this part to keep working to figuring this out. Thanks for your patience!

JiaMeihong · 2022-03-17T10:00:51Z

@kyletgoldberg Thanks very much for your detailed explanation! For the lambda part, I understand the version difference now.
So sorry for confusion caused by my previous comment (already deleted), it turns out I have not include calibration properly, so please ignore my previous comment.

What I found for now is, yes there's a converging issue when I try to add calibration, even just one channel of one testing period won't converge in my case; and I also played with another model without calibration, but a lot more media channels. The latter case also haven't converge after many iterations.

I got into a headache of dying kernel when iterations are too many, and thus feel that simply increasing iterations in hope of achieving convergence may not be a good idea. Would appreciate it very much if there're other workarounds to get convergence.

kyletgoldberg · 2022-03-23T15:16:44Z

@JiaMeihong do you mind sharing what you get when you run the following code after the non-convergent models in both calibration and non-calibration cases?

OutputModels$convergence$moo_distrb_plot
OutputModels$convergence$moo_cloud_plot

How many media channels are you including in that second case? We generally recommend to have at least 10 observations per independent variable so that may also be adding some difficulty. Thanks again for your patience

JiaMeihong · 2022-03-24T08:02:48Z

@kyletgoldberg
Hi, please find the following plot. with 2500 iterations and 10 trials, no metrics has converged.

How many media channels are you including in that second case?

I have added 7 channels in total, and each have more one year weekly data. Though I specify it to just learn the most recent one year in the window function.

kyletgoldberg · 2022-03-24T13:18:44Z

@JiaMeihong could you also run that without the calibration and share how it looks?

laresbernardo · 2022-03-24T16:08:05Z

@JiaMeihong can you give us a bit more context on your calibration inputs?

Are those experiments measuring the same KPI you are modeling (dep_var)?
Is the spend on that experiment similar to the spend of that media channel date range?
How confident are you of the incremental results measured?

JiaMeihong · 2022-03-28T12:48:47Z

@laresbernardo
Thanks for your reply!!

Are those experiments measuring the same KPI you are modeling (dep_var)?

Yes

Is the spend on that experiment similar to the spend of that media channel date range?

there's fluctuation in spending, but I tried to transform the AB test result by multiplying it by the ratio of average spending/ actual spending.

How confident are you of the incremental results measured?

Most of the results are from AB test, so it should be trustable.

Even without calibration, in another model where I try to include more channels (7 in total) to the model, it fails to converge as well.

extrospective · 2022-04-11T17:44:32Z

If we run a model to convergence, does the Pareto Front determination exclude solutions which are pre-convergence?

laresbernardo · 2022-04-18T19:38:08Z

Hi @extrospective
I guess the answer would be no, because we only compare the last 5% of models with the first ones, regardless of the number of iterations. I guess you could back-engineer the values calculated for the last quantile to compare with previous models and check “when” you converged but I think that doesn’t make much sense because these values are all relative.
Additionally, we only exclude solutions (models to be considered) that are NOT in the Pareto Front(s), regardless of the convergence. We don't have a pre and pos convergence reference.

extrospective · 2022-04-20T14:36:54Z

We are running Robyn 3.6.2

We had been using with calibration successfully, but we changed our target variable and are now encountering exactly the error mentioned here.

Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >=  : 
  NMF::createStream - invalid value for 'n' [positive value expected]
Some(<code style = 'font-size:10p'> Error in RNGseq(n, seed, ..., version = if (checkRNGversion(&quot;1.4&quot;) &gt;= : NMF::createStream - invalid value for 'n' [positive value expected] </code>)
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= : NMF::createStream - invalid value for 'n' [positive value expected]

If there is any wisdom on what causes that we would appreciate it, as we're on a tight deadline to turn around this model.

At first I thought it was that we had too few iterations, so we increased to 1 trial x 6000 iterations.

Here is stdout. As you can see, we did not run the 20,000 total iterations suggested, partly because this was already 6 hours and we just wanted to see if the code ran before scaling up.

[1] "robyn_run started"
Warning in check_iteration(InputCollect$calibration_input, iterations, trials,  :
  You are calibrating MMM. We recommend to run at least 2000 iterations per trial and 10 trials to build initial model
Input data has 4411 days in total: 2010-02-01 to 2022-02-28
Initial model is built on rolling window of 365 day: 2021-03-01 to 2022-02-28
Using geometric adstocking with 40 hyperparameters (40 to iterate + 0 fixed) on 16 cores
>>> Starting 1 trials with 6000 iterations each with calibration using TwoPointsDE nevergrad algorithm...
  Running trial 1 of 1

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  (...)
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
 
  Finished in 385.25 mins
Using robyn object location: output
Provided 'plot_folder' doesn't exist. Using default 'plot_folder = getwd()': /dbfs/robyn_output/poc/order_count_new/US/2022-02-28
>>> Running Pareto calculations for 6000 models on 3 fronts...

kyletgoldberg · 2022-04-20T15:07:44Z

@extrospective could you share your session info please? i've seen this error once before when a user was running on GCP, are you running locally or on a cloud based solution?

extrospective · 2022-04-20T15:22:45Z

we are running in databricks

extrospective · 2022-04-20T15:25:07Z

Databricks runtime 10.2

Here is the r version info:

$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$os
[1] "linux-gnu"

$system
[1] "x86_64, linux-gnu"

$status
[1] ""

$major
[1] "4"

$minor
[1] "1.2"

$year
[1] "2021"

$month
[1] "11"

$day
[1] "01"

$`svn rev`
[1] "81115"

$language
[1] "R"

$version.string
[1] "R version 4.1.2 (2021-11-01)"

$nickname
[1] "Bird Hippie"

We pulled from the git repo 3.6.2 roughly 2 weeks ago. checking now for details

extrospective · 2022-04-20T15:38:09Z

r 4.1.2 is the maximum available in databricks runtimes at this time.

kyletgoldberg · 2022-04-20T15:40:18Z

@extrospective would it be possible to try running locally to see if you get the same error? i don't think you need to run as many trials we should be able to see whether it works or not with fewer. I suspect this may be an issue with Robyn having some issues with cloud platforms so that would be a good one to rule out if we can

extrospective · 2022-04-20T16:06:33Z

We will try some further tests. not sure why "cloud platforms" should be a source of error, but as cloud platforms may have different versions of libraries, I think a library comparison between what works for you and what does not work for us is helpful.

Our next runs include:

with calibration off
locally

kyletgoldberg · 2022-04-20T16:09:42Z

good point. sessionInfo() should provide an output of all the libraries and their current versions if you could share that, I can compare that with what's working on our machines while you're working through those other runs.

extrospective · 2022-04-20T16:24:53Z

I am not sure the April 12 commit is in our copy. The prior commits should be in our copy of Robyn.

extrospective · 2022-04-20T16:27:11Z

This is the sessionInfo() for our run.
We especially wanted to check rngtools and doRNG versions, but these seem okay.
So we are going to investigate data or logic on our end.

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Robyn_3.6.2       reticulate_1.24   testit_0.13       SparkR_3.2.0     
 [5] scales_1.1.1      patchwork_1.1.1   ggplot2_3.3.5     stringr_1.4.0    
 [9] data.table_1.14.2 R.utils_2.11.0    R.oo_1.24.0       R.methodsS3_1.8.1
[13] readr_2.1.0       dplyr_1.0.7       plyr_1.8.7        rlist_0.4.6.2    

loaded via a namespace (and not attached):
 [1] httr_1.4.2         tidyr_1.2.0        jsonlite_1.8.0     splines_4.1.2     
 [5] foreach_1.5.2      here_1.0.1         RcppParallel_5.1.5 assertthat_0.2.1  
 [9] lares_5.1.2        doRNG_1.8.2        yaml_2.3.5         pillar_1.6.4      
[13] lattice_0.20-45    glue_1.5.0         pROC_1.18.0        digest_0.6.28     
[17] rvest_1.0.2        colorspace_2.0-3   htmltools_0.5.2    Matrix_1.3-4      
[21] pkgconfig_2.0.3    purrr_0.3.4        openxlsx_4.2.5     TeachingDemos_2.10
[25] tzdb_0.2.0         tibble_3.1.6       generics_0.1.1     ellipsis_0.3.2    
[29] withr_2.5.0        lazyeval_0.2.2     survival_3.2-13    magrittr_2.0.1    
[33] crayon_1.5.1       Rserve_1.8-10      fansi_0.5.0        doParallel_1.0.17 
[37] xml2_1.3.3         hwriter_1.3.2      tools_4.1.2        hms_1.1.1         
[41] minpack.lm_1.2-1   lifecycle_1.0.1    rpart.plot_3.1.0   munsell_0.5.0     
[45] glmnet_4.1-3       prophet_1.0        rngtools_1.5.2     zip_2.2.0         
[49] compiler_4.1.2     rlang_0.4.12       grid_4.1.2         RCurl_1.98-1.6    
[53] nloptr_2.0.0       ggridges_0.5.3     iterators_1.0.14   rPref_1.3         
[57] rappdirs_0.3.3     igraph_1.3.0       bitops_1.0-7       gtable_0.3.0      
[61] codetools_0.2-18   DBI_1.1.1          R6_2.5.1           hwriterPlus_1.0-3 
[65] lubridate_1.8.0    fastmap_1.1.0      utf8_1.2.2         rprojroot_2.0.3   
[69] h2o_3.36.0.4       shape_1.4.6        stringi_1.7.6      parallel_4.1.2    
[73] Rcpp_1.0.8.3       vctrs_0.3.8        rpart_4.1-15       png_0.1-7         
[77] tidyselect_1.1.1

extrospective · 2022-04-20T16:28:17Z

One change which was made on our end from a prior run was the addition of SparkR, so we will also test if that library is the source of an issue. --> We were wrong about this, sparkR has been in the session all along (see next comment).

extrospective · 2022-04-20T16:49:46Z

Databricks has sparkR in an empty notebook by default.
I note that there are some name conflicts with the tidyverse in sparkR.

We showed that if we turn calibration off, we do not encounter the error mentioned above.
We were unable to easily remove sparkR from the databricks test.
We will further examine whether anything about our data or its calibration contributed to the error.

[From this research we have learned about library doRNG and rngtools which can be used to trace through errors (for those who encounter this error in the future and want to investigate more rapidly.)]

extrospective · 2022-04-20T16:54:03Z

I suspect there is error in liftAbs in the calibration data -> found some NAs

I assume robyn_run() would have noticed this earlier then at the final pareto generation step; as it's input to objective function. Can checks be added for NA?

But first, I'll verify this is the source of the problem.

extrospective · 2022-04-20T17:15:26Z

We have confirmed this error for us as a data issue.
The liftAbs was NA for every row in the code which triggered this error, and once corrected the problem did not occur.

Based on this, I might recommend:

Robyn code detect that liftAbs is a valid float for every row. If not, throw an error

This assertion would then avoid an unusual and weird error.

And then I think this ticket might be closed with a hypothesis that this same issue caused the other errors reported with this same symptom.

kyletgoldberg · 2022-04-20T17:16:31Z

Thanks for confirming @extrospective - we will add a check for that and then close the issue out with that fix.

laresbernardo · 2022-04-20T17:38:44Z

Thanks for the feedback and for checking the source of this issue @extrospective You're recommendation has been implemented! Feel free to close this ticket if you consider it's been fixed

extrospective · 2022-04-21T16:08:24Z

I do not have the close ticket available to me. I recommend closing.

laresbernardo self-assigned this Feb 22, 2022

laresbernardo added the bug Something isn't working label Feb 28, 2022

laresbernardo assigned kyletgoldberg Mar 1, 2022

laresbernardo added a commit that referenced this issue Mar 24, 2022

feat: check calibration inputs and provide recommendations #307

d2ca2fb

laresbernardo mentioned this issue Mar 31, 2022

Allocation and plots improvements. Version 3.6.2 #361

Merged

laresbernardo added a commit that referenced this issue Apr 20, 2022

feat: check for NA or non-numerical liftAbs values #307

990aa8f

laresbernardo closed this as completed Apr 21, 2022

Robyn_refresh error when collection results #307

Robyn_refresh error when collection results #307

Comments

ohad-monday commented Feb 14, 2022

Project Robyn

Describe issue

Provide dummy data & model configuration

Environment & Robyn version

gufengzhou commented Feb 15, 2022

ohad-monday commented Feb 20, 2022

gufengzhou commented Feb 21, 2022

ohad-monday commented Feb 22, 2022

laresbernardo commented Feb 22, 2022

ohad-monday commented Feb 23, 2022

gufengzhou commented Feb 23, 2022

ohad-monday commented Feb 23, 2022

gufengzhou commented Feb 23, 2022

ohad-monday commented Feb 23, 2022

laresbernardo commented Feb 23, 2022 • edited Loading

ohad-monday commented Feb 24, 2022

gufengzhou commented Feb 24, 2022

gufengzhou commented Feb 24, 2022

ohad-monday commented Feb 28, 2022

laresbernardo commented Feb 28, 2022

ohad-monday commented Mar 1, 2022

JiaMeihong commented Mar 10, 2022 • edited Loading

kyletgoldberg commented Mar 10, 2022

JiaMeihong commented Mar 11, 2022

kyletgoldberg commented Mar 14, 2022

JiaMeihong commented Mar 16, 2022 • edited Loading

kyletgoldberg commented Mar 16, 2022

JiaMeihong commented Mar 17, 2022

kyletgoldberg commented Mar 23, 2022

JiaMeihong commented Mar 24, 2022 • edited Loading

kyletgoldberg commented Mar 24, 2022

laresbernardo commented Mar 24, 2022 • edited Loading

JiaMeihong commented Mar 28, 2022 • edited Loading

extrospective commented Apr 11, 2022

laresbernardo commented Apr 18, 2022

extrospective commented Apr 20, 2022 • edited by laresbernardo Loading

kyletgoldberg commented Apr 20, 2022

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022

kyletgoldberg commented Apr 20, 2022

extrospective commented Apr 20, 2022

kyletgoldberg commented Apr 20, 2022

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022 • edited Loading

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022

extrospective commented Apr 20, 2022 • edited Loading

kyletgoldberg commented Apr 20, 2022

laresbernardo commented Apr 20, 2022

extrospective commented Apr 21, 2022

laresbernardo commented Feb 23, 2022 •

edited

Loading

JiaMeihong commented Mar 10, 2022 •

edited

Loading

JiaMeihong commented Mar 16, 2022 •

edited

Loading

JiaMeihong commented Mar 24, 2022 •

edited

Loading

laresbernardo commented Mar 24, 2022 •

edited

Loading

JiaMeihong commented Mar 28, 2022 •

edited

Loading

extrospective commented Apr 20, 2022 •

edited by laresbernardo

Loading

extrospective commented Apr 20, 2022 •

edited

Loading

extrospective commented Apr 20, 2022 •

edited

Loading