Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robyn_refresh error when collection results #307

Closed
ohad-monday opened this issue Feb 14, 2022 · 48 comments
Closed

Robyn_refresh error when collection results #307

ohad-monday opened this issue Feb 14, 2022 · 48 comments
Assignees
Labels
bug Something isn't working

Comments

@ohad-monday
Copy link

Project Robyn

Describe issue

issue - when running robyn_refresh with more than 22 incremental days i get this error below

error message:

Collecting results...
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= :
NMF::createStream - invalid value for 'n' [positive value expected]
Calls: robyn_refresh ... robyn_run -> %dorng% -> do.call -> doRNGseq -> RNGseq

Provide dummy data & model configuration

what else is needed?

Environment & Robyn version

R version : R version 4.0.3 (2020-10-10)
Robyn version: Robyn_3.4.8

@gufengzhou
Copy link
Contributor

Hey, this is very strange. I can't think of any possible cause out of the box. We'll need a reproducible dataset to replicate this error, including a dataset (true value masked) and your demo.R file for model specification. If you don't want to share publicly, please send it via email @laresbernardo and me (bernardolares@fb.com, gufeng@fb.com)

@ohad-monday
Copy link
Author

what should i mask the data? By true value you mean features and target?

@gufengzhou
Copy link
Contributor

Users usually don't share real data with us but somehow randomised dataset.

@ohad-monday
Copy link
Author

ok, sent it to you email

@laresbernardo laresbernardo self-assigned this Feb 22, 2022
@laresbernardo
Copy link
Collaborator

Hi @ohad-monday
I've just run your example exactly as you sent us (with less iterations and trials) and ran OK. I'm attaching the R file I used with your demo.csv (not included). Please, be sure to update Robyn to the latest version (3.6.0 - released today) and try again.
Note that now we set iterations and trials within robyn_run() (you'll get a warning).
Do let us know if it works for you after updating.
issue_307.R.zip

@ohad-monday
Copy link
Author

@laresbernardo thank you, im checking right now.
One thing though, did you change something in the modeling part to could cause to very poor accuracy of the models?
i'm running it from the beginning of the pipeline, and the fitted models on the same data are terrible. did you change/add any param i need to add/change during the robyn_run?

@gufengzhou
Copy link
Contributor

hey, yes there're actually major improvement in optimisation, see here. Please also check out the new demo.R guide that introduces new functionalities and workflows. You should be also seeing new convergence message after the model runs. How many iterations are you running? Now for the simulated dataset it converges at 1.5-2k iterations.

@ohad-monday
Copy link
Author

  1. new version - i ran it with 2500 iterations... something is completely off
    image

  2. refresh bug: i ran it again with the new version, still getting the same error:
    Finished in 4.04 mins

Running Pareto calculations for 3000 models on 3 fronts...
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= :
NMF::createStream - invalid value for 'n' [positive value expected]
Calls: robyn_refresh ... robyn_pareto -> %dorng% -> do.call -> doRNGseq -> RNGseq
In addition: Warning message:
In check_calibconstr(calibration_constraint, OutputModels$iterations, :
calibration_constraint set for top 10% calibrated models. 300 models left for pareto-optimal selection. Minimum suggested:

@gufengzhou
Copy link
Contributor

Can you share this plot with us: OutputCollect$OutputModels$convergence$moo_distrb_plot

@ohad-monday
Copy link
Author

@gufengzhou I don't see a this plot here:
image

@laresbernardo
Copy link
Collaborator

laresbernardo commented Feb 23, 2022

@ohad-monday can you please check on OutputModels$convergence$moo_distrb_plot instead? Is that a the product of a robyn_refresh() or robyn_run()?

@ohad-monday
Copy link
Author

@laresbernardo sure! (it's from the robyn_run())
here it is:
image

@gufengzhou
Copy link
Contributor

Thanks for sharing. as you see, model hasn't converged, if you compare this with the example plot here. Esp. NRMSE hasn't moved from the right side and it reflects exactly the bad fitting plot you showed before. I recommend you to run higher iterations to try out. You can do for example 5k with 1 trial first to see if 5k converges. The reason for this change of convergence speed is that we've added lambda as extra hyperparameter to enable its automatic optimum selection. We'll fine-tune it over time to improve convergence speed.

@gufengzhou
Copy link
Contributor

FYI I've just committed a fix that should accelerate convergence. You should be seeing the "hills" in the plot moving left earlier compared to before. Let us know if it works.

@ohad-monday
Copy link
Author

@gufengzhou Found the reason you weren't able to reproduce the error i got. the problem is with a model with a 0 coefficient that also has calibration data for this channel. might be somehow related to the mape lift calculation for the nevegrad optimization?

@laresbernardo
Copy link
Collaborator

Uhh, bummer. Would you be able to send me (laresbernardo @gmail.com) a CSV with anonymized data and the .R file you're using to run Robyn so I can replicate exactly your issue? It'd be really useful to debug this error in case you're not able to do it yourself and make a pull request. I'd be happy to check given that Gufeng will be on leave for some months.

@laresbernardo laresbernardo added the bug Something isn't working label Feb 28, 2022
@ohad-monday
Copy link
Author

@laresbernardo thanks! i shared an updated params.R in email with calibration data. the refresh supposed to end with an error. than if you will exclude the f6 feature from the calibration data, it will work.

@JiaMeihong
Copy link

JiaMeihong commented Mar 10, 2022

Hi @laresbernardo @kyletgoldberg
I also use the new version of Robyn, and found calibration result very poor. My actual and predicted result looks very similar to this .
I tried increasing iterations to 5000 ~ 7000, but the it won't converge still. What's even more strange to me is that, the trained R2 is negative.
Is there any update over this issue?

Correct me if I'm wrong, my understanding is that previous version 3.5.0 choose best lambda of Ridge regression using 10 fold cv on window period. What has changed in the method to tune lambda in 3.6.0 version? and why do you decide to make the change?

Thanks very much for your help! I'm really learning a lot from your product development and discussion here!

@kyletgoldberg
Copy link
Contributor

@JiaMeihong could you check to see if any of the calibration data you are using is also corresponding to a channel that was assigned a 0 coefficient? Seems like that was causing the issue for @ohad-monday so that would be a good place to start.

@JiaMeihong
Copy link

@kyletgoldberg Hi, thanks for your reply! I've checked the coeffecient in "pareto_aggregated.csv". The channels of interest don't have a 0 coefficient. Though after iterations it seems that results are still far from convergence, all channels in my model result have positive coef.

@kyletgoldberg
Copy link
Contributor

Thanks for taking a look - how many different calibration inputs are you using? this is a difficult one to replicate on our end since we don't have data to recreate, but if you don't have too many calibrations, would it be possible to take them out one at a time and see if it is one particular calibration input causing an issue? I suspect some interaction between the calibration and the lambda hyperparameter optimization is causing the issue

@JiaMeihong
Copy link

JiaMeihong commented Mar 16, 2022

@kyletgoldberg
Thanks for your suggestion!
I've tried input just one media one testing period into calibration, but the result is still not working out.

My calibration data is by month, but my input data is by week. Does that possibly impact the result?

Could you briefly explain what has changed in this Robyn version 3.6.0 about calibration? Cus when using previous version with calibration, the result worked fine. I think by understanding the logic of change may help me to detect the bug further.

Thanks!

@kyletgoldberg
Copy link
Contributor

@JiaMeihong That shouldn't be an issue - nothing changed in 3.6 with respect to calibration, but in 3.6 we added lambda hyperparameter to the nevergrad optimization rather than using the CV method. This should lead to better results by allowing more flexibility in learning the hyperparameters, but also seems to be leading to some convergence issues when paired with calibration data at times.

It seems like what is essentially happening here is that we are running into cases where the calibration data is so at odds with how the model wants to fit that it is never going to converge, which is a difficult problem to solve.

I saw you had edited a comment that had initially said that removing one of the calibrations with the lowest lift had got the model to work again - was that the case? It could be worth digging into how exactly the test was set up vs. how the data is collected for the channel - i.e. does the test encompass all of the spend you are measuring? etc. to ensure that they are as aligned as possible. Would be great to understand any info you have on this part to keep working to figuring this out. Thanks for your patience!

@JiaMeihong
Copy link

@kyletgoldberg Thanks very much for your detailed explanation! For the lambda part, I understand the version difference now.
So sorry for confusion caused by my previous comment (already deleted), it turns out I have not include calibration properly, so please ignore my previous comment.

What I found for now is, yes there's a converging issue when I try to add calibration, even just one channel of one testing period won't converge in my case; and I also played with another model without calibration, but a lot more media channels. The latter case also haven't converge after many iterations.

I got into a headache of dying kernel when iterations are too many, and thus feel that simply increasing iterations in hope of achieving convergence may not be a good idea. Would appreciate it very much if there're other workarounds to get convergence.

@kyletgoldberg
Copy link
Contributor

@JiaMeihong do you mind sharing what you get when you run the following code after the non-convergent models in both calibration and non-calibration cases?

OutputModels$convergence$moo_distrb_plot
OutputModels$convergence$moo_cloud_plot

How many media channels are you including in that second case? We generally recommend to have at least 10 observations per independent variable so that may also be adding some difficulty. Thanks again for your patience

@JiaMeihong
Copy link

JiaMeihong commented Mar 24, 2022

@kyletgoldberg
Hi, please find the following plot. with 2500 iterations and 10 trials, no metrics has converged.
Screen Shot 2022-03-24 at 16 57 02

How many media channels are you including in that second case?

I have added 7 channels in total, and each have more one year weekly data. Though I specify it to just learn the most recent one year in the window function.

@kyletgoldberg
Copy link
Contributor

@JiaMeihong could you also run that without the calibration and share how it looks?

@laresbernardo
Copy link
Collaborator

laresbernardo commented Mar 24, 2022

@JiaMeihong can you give us a bit more context on your calibration inputs?

  • Are those experiments measuring the same KPI you are modeling (dep_var)?
  • Is the spend on that experiment similar to the spend of that media channel date range?
  • How confident are you of the incremental results measured?

@JiaMeihong
Copy link

JiaMeihong commented Mar 28, 2022

@laresbernardo
Thanks for your reply!!

Are those experiments measuring the same KPI you are modeling (dep_var)?

Yes

Is the spend on that experiment similar to the spend of that media channel date range?

there's fluctuation in spending, but I tried to transform the AB test result by multiplying it by the ratio of average spending/ actual spending.

How confident are you of the incremental results measured?

Most of the results are from AB test, so it should be trustable.

Even without calibration, in another model where I try to include more channels (7 in total) to the model, it fails to converge as well.

@extrospective
Copy link

If we run a model to convergence, does the Pareto Front determination exclude solutions which are pre-convergence?

@laresbernardo
Copy link
Collaborator

Hi @extrospective
I guess the answer would be no, because we only compare the last 5% of models with the first ones, regardless of the number of iterations. I guess you could back-engineer the values calculated for the last quantile to compare with previous models and check “when” you converged but I think that doesn’t make much sense because these values are all relative.
Additionally, we only exclude solutions (models to be considered) that are NOT in the Pareto Front(s), regardless of the convergence. We don't have a pre and pos convergence reference.

@extrospective
Copy link

extrospective commented Apr 20, 2022

We are running Robyn 3.6.2

We had been using with calibration successfully, but we changed our target variable and are now encountering exactly the error mentioned here.

Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >=  : 
  NMF::createStream - invalid value for 'n' [positive value expected]
Some(<code style = 'font-size:10p'> Error in RNGseq(n, seed, ..., version = if (checkRNGversion(&quot;1.4&quot;) &gt;= : NMF::createStream - invalid value for 'n' [positive value expected] </code>)
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= : NMF::createStream - invalid value for 'n' [positive value expected]

If there is any wisdom on what causes that we would appreciate it, as we're on a tight deadline to turn around this model.

At first I thought it was that we had too few iterations, so we increased to 1 trial x 6000 iterations.

Here is stdout. As you can see, we did not run the 20,000 total iterations suggested, partly because this was already 6 hours and we just wanted to see if the code ran before scaling up.

[1] "robyn_run started"
Warning in check_iteration(InputCollect$calibration_input, iterations, trials,  :
  You are calibrating MMM. We recommend to run at least 2000 iterations per trial and 10 trials to build initial model
Input data has 4411 days in total: 2010-02-01 to 2022-02-28
Initial model is built on rolling window of 365 day: 2021-03-01 to 2022-02-28
Using geometric adstocking with 40 hyperparameters (40 to iterate + 0 fixed) on 16 cores
>>> Starting 1 trials with 6000 iterations each with calibration using TwoPointsDE nevergrad algorithm...
  Running trial 1 of 1

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  (...)
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
 
  Finished in 385.25 mins
Using robyn object location: output
Provided 'plot_folder' doesn't exist. Using default 'plot_folder = getwd()': /dbfs/robyn_output/poc/order_count_new/US/2022-02-28
>>> Running Pareto calculations for 6000 models on 3 fronts...

@kyletgoldberg
Copy link
Contributor

@extrospective could you share your session info please? i've seen this error once before when a user was running on GCP, are you running locally or on a cloud based solution?

@extrospective
Copy link

we are running in databricks

@extrospective
Copy link

Databricks runtime 10.2

Here is the r version info:

$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$os
[1] "linux-gnu"

$system
[1] "x86_64, linux-gnu"

$status
[1] ""

$major
[1] "4"

$minor
[1] "1.2"

$year
[1] "2021"

$month
[1] "11"

$day
[1] "01"

$`svn rev`
[1] "81115"

$language
[1] "R"

$version.string
[1] "R version 4.1.2 (2021-11-01)"

$nickname
[1] "Bird Hippie"

We pulled from the git repo 3.6.2 roughly 2 weeks ago. checking now for details

@extrospective
Copy link

r 4.1.2 is the maximum available in databricks runtimes at this time.

@kyletgoldberg
Copy link
Contributor

@extrospective would it be possible to try running locally to see if you get the same error? i don't think you need to run as many trials we should be able to see whether it works or not with fewer. I suspect this may be an issue with Robyn having some issues with cloud platforms so that would be a good one to rule out if we can

@extrospective
Copy link

We will try some further tests. not sure why "cloud platforms" should be a source of error, but as cloud platforms may have different versions of libraries, I think a library comparison between what works for you and what does not work for us is helpful.

Our next runs include:

  • with calibration off
  • locally

@kyletgoldberg
Copy link
Contributor

good point. sessionInfo() should provide an output of all the libraries and their current versions if you could share that, I can compare that with what's working on our machines while you're working through those other runs.

@extrospective
Copy link

I am not sure the April 12 commit is in our copy. The prior commits should be in our copy of Robyn.

@extrospective
Copy link

This is the sessionInfo() for our run.
We especially wanted to check rngtools and doRNG versions, but these seem okay.
So we are going to investigate data or logic on our end.

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Robyn_3.6.2       reticulate_1.24   testit_0.13       SparkR_3.2.0     
 [5] scales_1.1.1      patchwork_1.1.1   ggplot2_3.3.5     stringr_1.4.0    
 [9] data.table_1.14.2 R.utils_2.11.0    R.oo_1.24.0       R.methodsS3_1.8.1
[13] readr_2.1.0       dplyr_1.0.7       plyr_1.8.7        rlist_0.4.6.2    

loaded via a namespace (and not attached):
 [1] httr_1.4.2         tidyr_1.2.0        jsonlite_1.8.0     splines_4.1.2     
 [5] foreach_1.5.2      here_1.0.1         RcppParallel_5.1.5 assertthat_0.2.1  
 [9] lares_5.1.2        doRNG_1.8.2        yaml_2.3.5         pillar_1.6.4      
[13] lattice_0.20-45    glue_1.5.0         pROC_1.18.0        digest_0.6.28     
[17] rvest_1.0.2        colorspace_2.0-3   htmltools_0.5.2    Matrix_1.3-4      
[21] pkgconfig_2.0.3    purrr_0.3.4        openxlsx_4.2.5     TeachingDemos_2.10
[25] tzdb_0.2.0         tibble_3.1.6       generics_0.1.1     ellipsis_0.3.2    
[29] withr_2.5.0        lazyeval_0.2.2     survival_3.2-13    magrittr_2.0.1    
[33] crayon_1.5.1       Rserve_1.8-10      fansi_0.5.0        doParallel_1.0.17 
[37] xml2_1.3.3         hwriter_1.3.2      tools_4.1.2        hms_1.1.1         
[41] minpack.lm_1.2-1   lifecycle_1.0.1    rpart.plot_3.1.0   munsell_0.5.0     
[45] glmnet_4.1-3       prophet_1.0        rngtools_1.5.2     zip_2.2.0         
[49] compiler_4.1.2     rlang_0.4.12       grid_4.1.2         RCurl_1.98-1.6    
[53] nloptr_2.0.0       ggridges_0.5.3     iterators_1.0.14   rPref_1.3         
[57] rappdirs_0.3.3     igraph_1.3.0       bitops_1.0-7       gtable_0.3.0      
[61] codetools_0.2-18   DBI_1.1.1          R6_2.5.1           hwriterPlus_1.0-3 
[65] lubridate_1.8.0    fastmap_1.1.0      utf8_1.2.2         rprojroot_2.0.3   
[69] h2o_3.36.0.4       shape_1.4.6        stringi_1.7.6      parallel_4.1.2    
[73] Rcpp_1.0.8.3       vctrs_0.3.8        rpart_4.1-15       png_0.1-7         
[77] tidyselect_1.1.1  

@extrospective
Copy link

extrospective commented Apr 20, 2022

One change which was made on our end from a prior run was the addition of SparkR, so we will also test if that library is the source of an issue. --> We were wrong about this, sparkR has been in the session all along (see next comment).

@extrospective
Copy link

Databricks has sparkR in an empty notebook by default.
I note that there are some name conflicts with the tidyverse in sparkR.

We showed that if we turn calibration off, we do not encounter the error mentioned above.
We were unable to easily remove sparkR from the databricks test.
We will further examine whether anything about our data or its calibration contributed to the error.

[From this research we have learned about library doRNG and rngtools which can be used to trace through errors (for those who encounter this error in the future and want to investigate more rapidly.)]

@extrospective
Copy link

I suspect there is error in liftAbs in the calibration data -> found some NAs

I assume robyn_run() would have noticed this earlier then at the final pareto generation step; as it's input to objective function. Can checks be added for NA?

But first, I'll verify this is the source of the problem.

@extrospective
Copy link

extrospective commented Apr 20, 2022

We have confirmed this error for us as a data issue.
The liftAbs was NA for every row in the code which triggered this error, and once corrected the problem did not occur.

Based on this, I might recommend:

  • Robyn code detect that liftAbs is a valid float for every row. If not, throw an error

This assertion would then avoid an unusual and weird error.

And then I think this ticket might be closed with a hypothesis that this same issue caused the other errors reported with this same symptom.

@kyletgoldberg
Copy link
Contributor

Thanks for confirming @extrospective - we will add a check for that and then close the issue out with that fix.

@laresbernardo
Copy link
Collaborator

Thanks for the feedback and for checking the source of this issue @extrospective You're recommendation has been implemented! Feel free to close this ticket if you consider it's been fixed

@extrospective
Copy link

I do not have the close ticket available to me. I recommend closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants