New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't IIDBootstrap support different sample sizes? #260

Closed
yanirs opened this Issue Dec 19, 2018 · 13 comments

Comments

Projects
None yet
2 participants
@yanirs
Copy link

yanirs commented Dec 19, 2018

Minimal example to show the current behaviour:

In [1]: from arch.bootstrap import IIDBootstrap

In [2]: IIDBootstrap([1, 2, 3], [4, 5])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-46a2c8a96fe3> in <module>()
----> 1 IIDBootstrap([1, 2, 3], [4, 5])

~/projects/automattic/data-science/conda_env/lib/python3.6/site-packages/arch/bootstrap/base.py in __init__(self, *args, **kwargs)
    155         for arg in all_args:
    156             if len(arg) != self._num_items:
--> 157                 raise ValueError("All inputs must have the same number of "
    158                                  "elements in axis 0")
    159         self._index = np.arange(self._num_items)

ValueError: All inputs must have the same number of elements in axis 0

With real-life data, sample sizes of different groups often vary. What is the reason for not supporting different sample sizes in IIDBootstrap? Is there a workaround?

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Dec 19, 2018

@yanirs

This comment has been minimized.

Copy link

yanirs commented Dec 19, 2018

Thanks for the quick reply! How would the use of itertools.product() help with calculating the difference in means with uneven sample sizes, though (for example)?

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Dec 19, 2018

@yanirs

This comment has been minimized.

Copy link

yanirs commented Dec 19, 2018

Sure, here's a more concrete example: Let's say I have a website that sells a small number of products. I run an experiment on 1,000 users, where 800 users receive a limited-time discount offer and 200 users don't receive the offer. I'd like to see how the offer affects short-term spending behaviour, e.g., on the week following the offer. So for each user, I measure how much they spent that week. I then end up with two vectors of observed spending figures, the first of length 800 and the second of length 200. I'd like to estimate the confidence interval for the difference of mean user spending between the two groups. I'd also like to get the confidence interval for the ratio of means, and possibly other metrics. Is that possible with IIDBootstrap?

We can't assume that the spending vectors come from normal distributions, as there are many zeroes and a small number of differently-priced products. I'm aware that the mean isn't the best way to characterise such distributions, but average revenue per user is a common business metric, so I'd still like to provide a good answer about the likely range of the difference/ratio of means between the groups. I'd like to use the BCA method for the confidence intervals, which is why I started looking at this package, as it seems like the only Python bootstrapping package that's well-maintained. 🙂

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Dec 20, 2018

I understand now. You have uneven samples that are independent. You can rewrite your problem as a linear regression, and then bootstrap the regression coefficient. To do this you need to construct a regressor matrix that has 1 column of 1s and a second column that is 1 if the subject go the treatment and 0 if not. The model you estimate is y = a + b * t where t is the treatment dummy. If b is different from 0, then the treatment has an effect. The implement a bootstrap, simply construct the x and y variables as

y = np.c_[y_no, y_yes]
n_no = y_no.shape[0]
nobs = y.shape[0]
x = np.ones((nobs,0))
x[:n_no] = 0

and then you can use IIDBootstrap(y,x) along with a regression estimator to perform bootsrap inference.

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Dec 20, 2018

Instructions on how to implement the bootstrap are here:

https://arch.readthedocs.io/en/latest/bootstrap/semiparametric-parametric-bootstrap.html

@yanirs

This comment has been minimized.

Copy link

yanirs commented Dec 21, 2018

Thank you! I think I get it. After fitting the regression model, y = a is the estimate for the no-treatment group and y = a + b is the estimate for the treatment group, so we can use the confidence intervals for a and a + b, right?

Is there anything methodologically wrong about doing it without a regression model, just by sampling from the groups separately and then calculating the metric of interest, as in the first example here? That's also the approach taken by the bootstrapped package (which isn't very well-maintained).

I tried running the example from the semiparametric bootstrap page but got the following error. Am I missing something or is there a bug in the example?

In [1]: import numpy as np 
   ...: def ols(y, x, params=None, x_orig=None): 
   ...:     if params is None: 
   ...:         return np.linalg.pinv(x).dot(y) 
   ...:  
   ...:     # When params is not None 
   ...:     # Bootstrap residuals 
   ...:     resids = y - x.dot(params) 
   ...:     # Simulated data 
   ...:     y_star = x_orig.dot(params) + resids 
   ...:     # Parameter estimates 
   ...:     return np.linalg.pinv(x_orig).dot(y_star) 
   ...:                                                                                                                                                                                                       

In [2]: from arch.bootstrap import IIDBootstrap 
   ...: x = np.random.randn(100,3) 
   ...: e = np.random.randn(100,1) 
   ...: b = np.arange(1,4) 
   ...: y = x.dot(b) + e 
   ...: bs = IIDBootstrap(y, x) 
   ...: ci = bs.conf_int(ols, 1000, method='percentile', 
   ...:                  sampling='semi', extra_kwargs={'x_orig': x})                                                                                                                                         

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-035af9d7ddc8> in <module>
      6 bs = IIDBootstrap(y, x)
      7 ci = bs.conf_int(ols, 1000, method='percentile',
----> 8                  sampling='semi', extra_kwargs={'x_orig': x})

~/miniconda3/envs/tmp-env/lib/python3.7/site-packages/arch/bootstrap/base.py in conf_int(self, func, reps, method, size, tail, extra_kwargs, reuse, sampling, std_err_func, studentize_reps)
    419                                                 std_err_func=std_err_func,
    420                                                 studentize_reps=studentize_reps,  # noqa
--> 421                                                 sampling=sampling)
    422 
    423         base, results = self._base, self._results

~/miniconda3/envs/tmp-env/lib/python3.7/site-packages/arch/bootstrap/base.py in _construct_bootstrap_estimates(self, func, reps, extra_kwargs, std_err_func, studentize_reps, sampling)
    623             elif semi:
    624                 kwargs['params'] = base
--> 625             results[count] = func(*pos_data, **kwargs)
    626             if std_err_func is not None:
    627                 std_err = std_err_func(results[count], *pos_data, **kwargs)

ValueError: could not broadcast input array from shape (3,100) into shape (3)
@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Dec 21, 2018

Some NumPy changes produced some subtle bugs. The docs have been fixed now.

The up-to-date copy is at:

http://bashtage.github.io/arch/doc/bootstrap/semiparametric-parametric-bootstrap.html

@yanirs

This comment has been minimized.

Copy link

yanirs commented Jan 4, 2019

Is there anything methodologically wrong about doing it without a regression model, just by sampling from the groups separately and then calculating the metric of interest, as in the first example here? That's also the approach taken by the bootstrapped package (which isn't very well-maintained).

Happy new year, @bashtage! I'm still curious about the above question. I recently read this paper, which mentions the uneven sample size problem:

The general rule is to sample in the same way the data were drawn, except to condition on the observed information, and any constraints. For example, when comparing samples of size n1 and n2, we fix those numbers and do a two-sample bootstrap with sizes n1 and n2, even if the original sampling procedure could have produced different counts.

Would it be possible to extend IIDBootstrap to support such sampling rather than relying on regression?

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Jan 4, 2019

Is there anything methodologically wrong about doing it without a regression model, just by sampling from the groups separately and then calculating the metric of interest, as in the first example here? That's also the approach taken by the bootstrapped package (which isn't very well-maintained).

There is nothing wrong with this approach. They are slightly different. If you have 200 in group A and 800 in group B and you do a bootstrap where you resample 200 from A and 800 from B, then you will alwyas have 200 and 800. WHen you use the regression approach and you resample from the joint set of 1000 then you will have some variance in the number from A and B. A will, on average be 20% but may be 22% or 18% in a particular sample.

Would it be possible to extend IIDBootstrap to support such sampling rather than relying on regression?

Not easily from what I can see. The entire infrastructure is designed to

  1. Resample the data in a single pass so that n data points become n resamples data points
  2. Apply the statistic to the observations of the resampled data

For simple applications it is easy to do this, like the examples, you pointed to above. For more sophisticated methods, e.g., BCA, it is more difficult to implement since you have to compute other aspects.

I suspect one could write a bootstrap called TwoSampleIIDBootstrap with a different signature that would work.

@yanirs

This comment has been minimized.

Copy link

yanirs commented Jan 5, 2019

Thanks for the detailed reply! I'm closing the issue as this answers my questions. 🙂

@yanirs yanirs closed this Jan 5, 2019

@bashtage

This comment has been minimized.

Copy link
Owner

bashtage commented Jan 7, 2019

@yanirs I thought about this a bit more and it was easy to add a new bootstrap that handles exactly this case. It is still a PR but should be in soon #267. The class is IndependentSamplesBootstrap.

@yanirs

This comment has been minimized.

Copy link

yanirs commented Jan 7, 2019

@bashtage That's great, thank you! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment