Skip to content

Conversation

@nmdefries
Copy link
Contributor

@nmdefries nmdefries commented May 26, 2021

Description

Documented and actual calculations disagree on use of weights in binary.R standard errors. Define and use effective sample size (corresponding to respondent weights) instead of raw sample size (corresponding to uniform weights).

Changelog

  • Define effective sample size as 1/sum(w_i^2), which equals raw sample size when using uniform weights.
  • Use effective sample size instead of raw sample size in SE calculation.
  • Use effective sample size when adjusting binary proportions with Jeffreys prior.
  • Use effective sample size when calculating binary SEs.
  • Add tests to check that implementation matches definition as given in our documentation.
  • Update gold_receiving files.

Fixes

Code changes for #1271.

@nmdefries nmdefries marked this pull request as draft May 26, 2021 17:22
@nmdefries nmdefries changed the title [fb-survey] Use effective samples size for to calculate binary SEs [fb-survey] Use effective samples size for binary SEs May 26, 2021
@nmdefries nmdefries marked this pull request as ready for review October 19, 2021 14:47
@capnrefsmmat
Copy link
Contributor

I think we definitely need a couple of unit tests for SE calculations using weighted data, maybe by making up five or ten weights and comparing the result to our documentation. Without that, I'll be nervous.

Also, does this PR only change binary indicators, and not the CLI/ILI indicators? Those will also need correct weighting.

@nmdefries
Copy link
Contributor Author

nmdefries commented Oct 22, 2021

Sure, I'll add some tests.

CLI/ILI should already match the documentation. Here's the counts indicators code for effective sample size and initial SE and for adjusted SE.

I believe it's only binary indicators that haven't been incorporating weights properly. Let me know if I'm misunderstanding something.

@nmdefries
Copy link
Contributor Author

nmdefries commented Oct 27, 2021

Discovered some things.

About the docs

About the code

  • binary.R::jeffreys_percentage should probably use effective_sample_size when adding the Jeffreys adjustment. It's currently using plain sample_size.This means that previous proportions will have been calculated differently (incorrectly?). Not sure if it's wrong, or if both ways are technically valid and it's user's choice. However, count indicators do use effective sample size when applying Jeffreys prior.
  • The formula for adjusted SE for count indicators in the code isn't exact. Out of (1 + 2 n_eff + n_eff^2) * old_se^2, the first two terms appear to have been dropped (since they're << than the third term) which means that it is close to but doesn't exactly agree with the docs.
  • The formula for SE for binary indicators also appears to be approximate. It's not clear to me that the p * (1-p) / n formula generalizes to the weighted case. Indeed, the documented formula and in-code formula give more-different results the farther from uniform the weights are.

These last two points are causing the test failures, although the absolute difference in implemented and from-the-docs SEs is smallish, O(0.001-0.01).

@capnrefsmmat
Copy link
Contributor

I'll set aside the documentation points for now, since we should figure out how to fix the code and then ensure all documentation matches it.

binary.R::jeffreys_percentage should probably use effective_sample_size when adding the Jeffreys adjustment. It's currently using plain sample_size. This means that previous proportions will have been calculated differently (incorrectly?). Not sure if it's wrong, or if both ways are technically valid and it's user's choice. However, count indicators do use effective sample size when applying Jeffreys prior.

It's more than that: compute_binary_response does not attempt to generate an effective sample size at all, and just uses the sample size. So the proportions, Jeffreys adjustments, and SEs all match: they don't use effective sample size.

The formula for adjusted SE for count indicators in the code isn't exact. Out of (1 + 2 n_eff + n_eff^2) * old_se^2, the first two terms appear to have been dropped (since they're << than the third term) which means that it is close to but doesn't exactly agree with the docs.

Your link points to jeffreys_se, but the main SE calculation for count indicators is in compute_count_response. jeffreys_se is intended to adjust that standard error to be calculated as if the proportion had the Jeffreys adjustment applied. I believe that calculation correctly matches the documentation.

The formula for SE for binary indicators also appears to be approximate. It's not clear to me that the p * (1-p) / n formula generalizes to the weighted case. Indeed, the documented formula and in-code formula give more-different results the farther from uniform the weights are.

Yes, I think you're right here, and the code needs to match the documentation.

@nmdefries
Copy link
Contributor Author

nmdefries commented Oct 28, 2021

binary.R::jeffreys_percentage should probably use effective_sample_size when adding the Jeffreys adjustment. It's currently using plain sample_size. This means that previous proportions will have been calculated differently (incorrectly?). Not sure if it's wrong, or if both ways are technically valid and it's user's choice. However, count indicators do use effective sample size when applying Jeffreys prior.

It's more than that: compute_binary_response does not attempt to generate an effective sample size at all, and just uses the sample size.

I agree, but I've already fixed that particular aspect, that is, I've defined effective_sample_size using respondent weights so that where it is used, it's the correct value. But effective_sample_size is not being used in binary.R::jeffreys_percentage. At the moment, we're adjusting binary estimates like

(percentage * sample_size + 50) / (sample_size + 1)

regardless of if the estimate is weighted or unweighted. But we should be using (the correct value for) effective sample size instead, right?

The implication of this is that historical percentages for binary indicators are wrong. Perhaps we should delete them along with SEs.

The formula for adjusted SE for count indicators in the code isn't exact. Out of (1 + 2 n_eff + n_eff^2) * old_se^2, the first two terms appear to have been dropped (since they're << than the third term) which means that it is close to but doesn't exactly agree with the docs.

Your link points to jeffreys_se, but the main SE calculation for count indicators is in compute_count_response. jeffreys_se is intended to adjust that standard error to be calculated as if the proportion had the Jeffreys adjustment applied. I believe that calculation correctly matches the documentation.

The unadjusted formula looks fine, it's the adjustment formula that is inexact. Let me write out the algebra.

So starting from the documentation for SE adjustment (minus the typo),

Screenshot from 2021-10-28 11-53-28

But the implemented formula is

So there's these two missing terms in the square root, 1 * s_p ^2 and 2 * n_eff * s_p ^2, that appear to have been dropped because they're small relative to n_eff^2 * s_p ^2. It's not wrong, but it's not like it's harder to calculate the exact value, plus it doesn't match the documentation. So we can change the adjustment formula to be exact or mention the approximation in the docs. Or I guess decide it's close enough not to matter?

The formula for SE for binary indicators also appears to be approximate. It's not clear to me that the p * (1-p) / n formula generalizes to the weighted case.

Yes, I think you're right here, and the code needs to match the documentation.

👍

@capnrefsmmat
Copy link
Contributor

I agree, but I've already fixed that particular aspect, that is, I've defined effective_sample_size using respondent weights so that where it is used, it's the correct value. But effective_sample_size is not being used in binary.R::jeffreys_percentage. At the moment, we're adjusting binary estimates like

(percentage * sample_size + 50) / (sample_size + 1)

regardless of if the estimate is weighted or unweighted. But we should be using (the correct value for) effective sample size instead, right?

The implication of this is that historical percentages for binary indicators are wrong. Perhaps we should delete them along with SEs.

Well... in principle, the Jeffreys adjustment is a hack. We're taking the actual proportion and deliberately fudging it so it can't be 0. A Jeffreys prior is one such way to fudge it, and can be justified by appeal to a Bayesian inference procedure, but there are others. The approach we're using matches the Jeffreys prior in the unweighted case but not in the weighted case. How bad is that?

Not too bad, I don't think, as long as the difference between the two approaches is not large and we document it. In fact this current approach does have one advantage: the user can undo it using data in the API, so they can see the actual sample proportion.

The unadjusted formula looks fine, it's the adjustment formula that is inexact. Let me write out the algebra.

So starting from the documentation for SE adjustment (minus the typo),

Screenshot from 2021-10-28 11-53-28

But the implemented formula is

So there's these two missing terms in the square root, 1 * s_p ^2 and 2 * n_eff * s_p ^2, that appear to have been dropped because they're small relative to n_eff^2 * s_p ^2. It's not wrong, but it's not like it's harder to calculate the exact value, plus it doesn't match the documentation. So we can change the adjustment formula to be exact or mention the approximation in the docs. Or I guess decide it's close enough not to matter?

Hmmm. How did you arrive at the conclusion that the documented version should not have that n_e? Note that n_e in the documentation is one over the sum of the squared weights, and not just the sum of the weights.

The original derivation is documented here: https://github.com/cmu-delphi/covid-19/blob/deeb4dc1e9a30622b415361ef6b99198e77d2a94/facebook/prepare-extracts/aggregations-setup.R#L210-L247

It uses slightly different notation, so let me do it here. Suppose w_i is the weight for observation i, after normalization so they sum to 1. Let y_i be the response for observation i and let \hat \mu be the mean over all observations. Then we have that

1

Now we add a "fake" observation with value 1/2. Working on both sides,

2

Now we solve for the new standard error:

3

Now we see the outline of what's implemented in jeffreys_se, since the weight normalization means that the sum of the weights is the sample size.

The remaining question is whether jeffreys_se is using the correct inputs to achieve the desired results. First, it's using sample_size instead of the sum of the weights. As it's written now, sample_size is actually the effective sample size produced in compute_count_response, i.e.

4

where n_e is defined as in the documentation. Second, it's using

5

as the "old se" instead of the one defined above that correctly accounts for the weights. If we substitute these choices in, what do we get?

I dunno, I'm tired, let me finish this later. But evidently I've gone about this a different way than you did. Any idea where we diverged specifically, so we can figure out what's wrong before I finish this off?

@nmdefries
Copy link
Contributor Author

nmdefries commented Nov 1, 2021

The approach we're using matches the Jeffreys prior in the unweighted case but not in the weighted case. How bad is that?

Not too bad, I don't think, as long as the difference between the two approaches is not large and we document it. In fact this current approach does have one advantage: the user can undo it using data in the API, so they can see the actual sample proportion.

The difference is small (something like O(0.01)) on test data using weights drawn from a uniform distribution; haven't compared with real data.

But we are applying the Jeffreys prior using effective sample size for count indicators vs plain sample size for binary indicators, so there's some inconsistency there. Is using raw sample size outright wrong in the weighted case, or is it user's choice and either raw or weighted sample size would be valid (since the adjustment is kind of hacky, like you said)? Are we justified in using plain sample size?

The original derivation is documented here: https://github.com/cmu-delphi/covid-19/blob/deeb4dc1e9a30622b415361ef6b99198e77d2a94/facebook/prepare-extracts/aggregations-setup.R#L210-L247

Thanks for linking the derivation, that's helpful. I didn't realize there were different definitions of n_eff floating around.

RE what I think is a typo,

Screenshot from 2021-11-01 19-43-44

Adding more in a separate comment on the derivation you linked.

@capnrefsmmat
Copy link
Contributor

Okay, I just sent you an invite to an Overleaf document (basically Google Docs for LaTeX, if you haven't used it before) so we can work this out there instead of sharing screenshots of LaTeX snippets. I've written out most of the derivation, with some corrections made for typos and such.

So far I'm leaning in the direction that you're right and the code has a typo. But I'd like to write out the derivation in the code in detail and see how, precisely, the two versions differ and where we might have gone wrong in the code; that seems prudent when we don't quite understand how the code got this way, since maybe it's smarter than we are.

@nmdefries
Copy link
Contributor Author

Thanks for setting up the doc. I added a derivation in the same vein as the one you linked but assuming that the weights are normalized.

@capnrefsmmat
Copy link
Contributor

I've updated the Overleaf: I think the difference between code and documentation can be reduced to a simple difference (w0 = 1/neff vs 1/(1 + neff)). Does that look right to you?

If so, then we can proceed to deciding which one is more correct.

@nmdefries
Copy link
Contributor Author

Yep, looks good!

@capnrefsmmat
Copy link
Contributor

Made a further update to reorganize the document so it's easy to compare the approaches and understand them. I think that helped me realize a further issue in the documented version that explains why it's different, beyond the choice of w_0. I've added extra explanation.

I think the conclusion is that the code is correct and the documentation is wrong. Your definition followed the documentation and started at equation (30), but the code started at equation (19), which is more correct. Or at least it's more correct if you want to add the extra fake weight to both numerator and denominator. Let me know if you agree.

@nmdefries
Copy link
Contributor Author

Yes, I agree with your conclusion. The discussion of what w_0 should be in the unweighted case is particularly compelling in support of the implemented definition.

The derivations in question, for documentation purposes.

@nmdefries
Copy link
Contributor Author

I sent you an invite to another Overleaf doc dealing with the binary case.

Transformˆp: recalculate it as if there were one additional observation i = 0 with Y_0= 1/2, then calculate the standard error according to the equation above while including that extra observation in the sum.

We can't do this directly using the explicit sum(w_i ^ 2 * (Y_i - p_hat) ^ 2 since the lists of weights and responses aren't available in the recalculation/adjustment function. More detail in the doc.

@capnrefsmmat
Copy link
Contributor

Thanks, the document looks good. I think we can go ahead and implement option 2, and prepare a separate PR to fix the documentation for both binary and count indicators.

@capnrefsmmat
Copy link
Contributor

And if I understand correctly, all this work led to the conclusion "This PR is correct"? So we just need to prepare the documentation?

@krivard
Copy link
Contributor

krivard commented Nov 11, 2021

(fix the tests)

clarify comment

add uniform tests, increase sample size, correct counts formula

update tests

new test names
@nmdefries nmdefries force-pushed the fb-package-weight-binary-ses branch from edaa3bf to f13f9c7 Compare November 12, 2021 17:47
@nmdefries
Copy link
Contributor Author

And derivation/discussion for binary indicators

This PR isn't quite done, I still need to repair the integration tests.

@nmdefries
Copy link
Contributor Author

@capnrefsmmat This is ready to review.

Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Let's plan to deploy this the week of Nov 30, and figure out a backfill plan with Katie to begin that week.

@nmdefries
Copy link
Contributor Author

Wait for #1397 to be merged.

@nmdefries
Copy link
Contributor Author

@krivard This is ready to merge. Since the CTIS API and contingency table pipelines are turned off at this point, this will be used for regenerating/backfilling data with the new, corrected SE definition.

@krivard krivard merged commit 8065c6a into main Jul 15, 2022
@krivard krivard deleted the fb-package-weight-binary-ses branch July 15, 2022 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants