Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification Questions #214

Closed
RaviP1987 opened this issue Nov 10, 2021 · 12 comments
Closed

Clarification Questions #214

RaviP1987 opened this issue Nov 10, 2021 · 12 comments

Comments

@RaviP1987
Copy link

Hi FB Team

Great initiative and really impressive work!!! I have a few questions to better understand the package

  1. Where should i add the following dependent variables - DIstribution, Store Display and Pricing. I understand we cant add this in the context variables (as these are related to trend/seasonality) and cant add to 'organic_vars' either because the code will apply Adstock/Saturation transformations to 'organic_vars' ? PLease advise. (Also, i am not sure what is the purpose of 'factors-vars'. kindly help me understand)

  2. I only have monthly data and my category is non-seasonal. How can i delete the context variable from the modelling ?

  3. I see that the objective of identifying the most optimal models doesnt include 'R Square' and 'MAPE'. May i understand why you haven't included these criteria while identifying the optimal models ?

  4. How can i identify the 'Durbin Watson', 'VIF' and 'P-value' for each independent variable ? To better understand the model parameters

  5. Lastly, i have only 45 time series data points (national data), but i have these broken down at a 'product SKU' and 'regional level'. Any advise how can i use the regional and product SKU data to compensate for relatively low (45) time series data points ?

Sorry for the long list of questions, I would really appreciate if you could advise on the above

Thanks very much
Ravi

@gufengzhou
Copy link
Contributor

Thanks!

  1. DIstribution, Store Display and Pricing do sound like context_vars to me. When you're not expecting adstock & saturation, context_vars is actually the only place to go. factor_vars is to specify which variables are factorial/categorical. For example if you have variable "offline_events" that contains only 0 and 1, you should use it in context_var or organic_var AND specify it in factor_vars.
  2. Although Robyn can handle monthly data, it's highly recommend to use lower time grain. We recommend column:row ratio to be 1:10 for the input data. Too sparse data will result in unstable model. Even you consider your data is non-seasonal, I would still recommend to use prophet_vars to account for subtile underlying patterns.
  3. We use NRMSE as loss function to control for model error, a metric that is highly correlated to R-squared and MAPE. It'd be redundant and burden the multi-objective optimisation unnecessarily
  4. We don't provide stats for single indep.vars because of the regularisation technique that we're using. Explanation see here
  5. Unfortunately, Robyn can't treat panel data properly at the moment. This is something we consider to include in the future.

@RaviP1987
Copy link
Author

RaviP1987 commented Nov 11, 2021

Thanks so much Gufeng for the quick reply! very clear, just a few follow up questions

  1. How can I identify what transformations are applied on 'Context_vars'. I am asking because some variables like Price or Promotion generally need Square root(x) transformations
  2. On the response curves, Two clarifications please: (a) I understand y-axis is the dependent variable ("revenue" in the current dataset). Please confirm. (b) on the x-axis, can i change the code to plot 'GRPs or impressions' instead of spend ? (kindly point me towards the right code file, if possible).
  3. On Share of spend/effect and ROI chart, i understand the ROI is for the full modelling time period. I wanted to understand how can I see ROI by each year (or by time periods). This is because ROI generally changes with creative executions (FB banner Ad or TV Ad).
  4. Makes sense that NRMSE optimizes both R sq and MAPE. However, sometimes teams want to see R Sq/MAPE etc. Can you please point me towards the code file, where I can maybe add 1-2 lines (locally) to print/display MAPE/R Sq ? (Same for DW, VIF and p values)
  5. Currently, the outputs don't show NRMSE of test and train data set separately. Could you please point me towards the code file, where i can see/add some code to print NRMSE/R Sq/MAPE for Test and train dataset separately ?

Lastly, I must say that Robyn has massively helped me in saving time and effort for an MMM model. I was actually writing a MMM model code and i stumbled upon the Robyn project!! Thank you so much for making this open source

Regards
Ravi

@gufengzhou
Copy link
Contributor

Glad to know it helps.

  1. Context_vars are not transformed at all. If desired, you must transform them before running Robyn.
  2. Yes, y is dep_var, or revenue for you. When using non-spend vars to represent paid_media_vars, there's a two-stage transformation when doing the response curve: response -> GRP using hill function that is our saturation function, then GRP -> spend using michaelis menten function that is our so-called spend-exposure model. For now it always defaults to spend for x-axis. We'll consider offering the option to plot exposure metrics in the future. Code snippet for plotting is here
  3. In the pareto_alldecomp_matrix.csv, or just OutputCollect$xDecompVecCollect, you can find the decomposed effect as time series. Using your original spend series, you can custom any ROI
  4. We do provide rsq along with NRMSE for each model in all output tables in OutputCollect or all CSVs. We don't provide MAPE as it's not considered diagnostic KPI and redundant to/less accurate than NRMSE. We don't provide DW because it requires the OLS assumption while Robyn uses regularised regression. Esp. when using trend/season from Prophet, autocorrelation is treated, as well as when lagged transformation from the weibull_pdf adstock is selected. We don't provide VIF also because it doesn't make sense to look at VIF score when performing a regularised regression. Same for p-values that follows OLS assumptions (please read this thread)
  5. We've deprecated test/train validation that is explained here

@RaviP1987
Copy link
Author

RaviP1987 commented Nov 11, 2021

Hi Gufeng,

Thanks a lot for your explanations. a few final thoughts/questions

  1. On response curve, just wanted to do a quick sense check with you: if i change the graph to plot 'paid_variables'
    on the x-axis instead of 'spends', I assume it won't impact the rest of the codebase. Please correct me if I am wrong ?
  2. I understand p-values are meaningless in Ridge, but my intent is to have some measure of 'Auto-correlation' and 'Multi-collinearity' in the model (I am sure the values here would be optimal, but i just want the data to prove it, in case this question is raised by my organization). Please advise what options i have ?
  3. Lastly, I wanted to check if it's possible to use excel/csv files to load data (instead of rdata files)? Sorry, if you have answered it earlier, would be great if you can point me to the right thread.

Thanks so much again!!

Regards
Ravi

@gufengzhou
Copy link
Contributor

hey sorry I missed your last questions:

  1. yes i believe you can adapt x-axis to something else than spend
  2. well i think durbin watson score might still be useful because it's calculated from the model residuals. for multi-collinearity, if you want to use VIF, I don't think that will give you proper insight because that's something you do before the fitting. In fact, regularization doesn't "eliminate" the multi-collinearity. We know that high multicollienarity is present, meaning that some columns of X can be represented as linear combination of other columns, the X'X matrix is singular and thus not invertible that is the issue when solving the equation. By adding the regularised term, the X'X matrix becomes regular and thus invertible. In summary, the inter-correlation between indep_vars is still present, but we can solve it nonetheless with regularisation.
  3. Yes you can totally read csv files as input. To avoid potential error, use data.table::fread() instead of read.csv

@RaviP1987
Copy link
Author

Thanks so much Gufeng! I have one more question for you advise (if i may)

I have a variable (in-store display) which isn't really media and doesn't need Adstock. If i put it as a context variable, i am seeing the model outputs shows very high contribution (despite displays being present in only 10% of our store universe). I know that 'decomp.rssd' is the way to avoid this error, but it is applicable only for 'paid media' in the config file (and i don't want to put 'display; in paid media given Ad stock isn't applicable)

Can you please advise how can I solve this dilemma ? is there any other way i can specify the costs of 'displays' which allows decomp.RSSD to pick it without using an Ad stock factor ?

Thanks so much for your help

@gufengzhou
Copy link
Contributor

no worries. you can use this in-store display as paid_meida_vars normally and just restrict the adstocking parameter(s) to very narrow bounds or even fix it to 0. E.g. for geometric adstock, you can do in_store_display_thetas = c(0, 0.01) for a narrow low range or fix it as in_store_display_thetas = 0 in the hyperparameters. For weibull, set both shape & scale to a close to 0 number, like in_store_display_shapes = 0.0001 and in_store_display_scales = 0.0001 to have a decay vector that is neglectable

@RaviP1987
Copy link
Author

Thanks very much Gufeng! much appreciated. on this, i assume weibull alpha and gamma doesn't need any change to avoid Adstock. Please correct me if I am wrong. Thank you

@gufengzhou
Copy link
Contributor

alpha and gamma are saturation parameters. I think it's reasonable to expect saturation for any paid media, so I'd recommend to keep the standard hyperparameter ranges

@RaviP1987
Copy link
Author

Thanks Gufeng. Actually, for displays - there shouldn't be any saturation because this variable just counts # of stores with display. Every new store with a display should add incremental sales. in this scenario, what alpha/gamma values you recommend to ensure saturation isnt considered by the algorithm. Thank you

@gufengzhou
Copy link
Contributor

ah ok then this isn't a media variable at all. it should indeed go into context_vars. what dep_var do you have? it seems like the number of stores with display correlates with your dep_var. assuming the causal relationship makes sense, meaning "adding stores with display drives sales", then the larger contribution of this var confirms this. the fact that it's only 10% is irrelevant, because it's about the relationship "the more of A, the more of B". I'm don't how large the contribution is but it's quite usual in offline business that in-store sales tactics are having very effective.

for now, Robyn doesn't have much features to "tame" context vars. you can try to run more iterations, allow wider hyperparameters for you media and output more pareto_fronts to look for models that have stronger media decomposition.

@RaviP1987
Copy link
Author

Thanks very Gufeng! this is helpful perspective!! really appreciate it! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants