Differential privacy for dummies
A few weeks back I was in Tel Aviv, where the paper behind differential privacy got a "test of time" award. That ruled.
I think differential privacy is quite neat, and still has a great deal of untapped potential. While looking around to see what had happened with it while I was making laptops powerful, I was reminded of several fairly disappointing manuscripts. We had previously ignored them because they were just plain-out wrong, and we imagined that the scientific community would reach the same conclusion. They largely did.
The error in this thinking is that there is more than just the scientific community out there. You have random lay-people, as well as experts in other areas, and when they look around to see what is doing with differential privacy, they come to conclusions like this:
So there is this "great article" Fool's Gold: An Illustrated Critique of Differential Privacy. You can go take a read if you like; it is about fifty pages, although you may be disappointed to learn that there are no actual illustrations.
Instead the article, published in the Vanderbilt Journal of Entertainment and Technology Law, has a long procession of factually inaccurate statements, reflecting what appears to be the authors' fundamental lack of familiarity with probability and statistics. Or, more cynically, the article means to take advantage of a reader's lack of familiarity, because the authors work in a field where this background is absolutely mandatory.
We are going to go through the article's points, and you get to make the call on the timeless "ignorance vs malice" question.
Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.
Imagine the input to a randomized computation as a set of records. The formal version of differential privacy requires that the probability a computation produces any given output changes by at most a multiplicative factor when you add or remove one record from the input. The probability derives only from the randomness of the computation; all other quantities (input data, record to add or remove, given output) are taken to be the worst possible case. The largest multiplicative factor (actually, its natural logarithm) quantifies the amount of "privacy difference".
Here is the version I wrote in the PINQ paper, which I like, where the "plus circle" thing is symmetric difference: records in one set but not the other.
Importantly, differential privacy is only a definition. It is what we use to measure the privacy properties of a randomized algorithm; it is not an algorithm itself. This is a category error we see fairly often in critiques of differential privacy.
Differentially private computations
The simplest differentially private computation counts the number of records in the dataset with some property, and reports the resulting count after we have added noise drawn from a Laplace distribution.
Here is a picture of the Laplace distribution, added to two different counts: 107 and 108.
The Laplace distribution is exponentially concentrated: the probability it has a value
t units from the truth drops off exponentially with
t. That makes it potentially pretty accurate, but more importantly this is why it has differential privacy properties. I took a screen shot of the proof from the PINQ paper, to try and optimize readability:
Now, maybe you grok the proof and maybe you don't. That's fine. But, it's a proof. If you think it isn't actually correct, you are free to point out the faulty step and we can make some progress. So far, I'm only aware of one paper that claims this isn't true, and you will never guess who the authors are.
Prelude: An Introduction
Several years ago, a paper Differential Privacy for Numeric Data made some exciting claims in its abstract.
Abstract. The concept of differential privacy has received considerable attention in the literature recently. In this paper we evaluate the masking mechanism based on Laplace noise addition to satisfy differential privacy. The results of this study indicate that the Laplace based noise addition procedure does not satisfy the requirements of differential privacy.
Holy crap. If this is true, I'm probably out of a job.
Now, of course it wasn't true. A cursory reading of the paper reveals that the authors have simply mis-copied a formula from the paper they read about differential privacy, and then dove deeply on that. I reached out to them:
From: Frank McSherry Sent: Wednesday, December 09, 2009 2:50 PM
I was recently pointed at your invited paper “Differential Privacy for Numeric Data”, presented recently, which concludes “that the Laplace based noise addition procedure does not satisfy the requirements of differential privacy.” I am writing to alert you, in the event that no one else has yet, to the misunderstandings of our work presented in your paper.
The main source of confusion seems to be in your reproduction, on the top of page 2, of equation 2 from [Dwork 2006]. In [Dwork 2006], available at http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf, equation 2 looks dissimilar from what you have reproduced. In particular, Delta f is not defined as a function of X, or of n, as it appears in your paper. In [Dwork 2006] It is a maximum taken over all possible pairs of data sets, D_1 and D_2, differing on a single record. It is not the maximum over data sets X^* differing from the true data set X in one record. The interpretation you present is known as “local sensitivity”, and is presented in the paper “Smooth Sensitivity and Sampling in Private Data Analysis”, available at http://people.csail.mit.edu/asmith/PS/stoc321-nissim.pdf. In this paper, the authors observe that local sensitivity would be a desirable calibration to use for Laplace noise, as it can often be much less than the global sensitivity from Dwork 2006, but that [as you have discovered] it does not provide differential privacy if so used. Instead, they introduce “smooth sensitivity”, which achieves many of the properties of local sensitivity while still providing differential privacy.
Most of the rest of the negative findings in your paper appear to follow from this misuse of local sensitivity in place of global sensitivity. While there are certainly challenges in applying differential privacy to continuous data of arbitrary scale, the problem is not in the guarantees of differential privacy provided by the Laplace mechanism, but in identifying a scale for which global sensitivity bounds exist. Smooth sensitivity is one way to resolve this, among many other (e.g. clamping the numerical data to a fixed range, perhaps via top-coding).
I hope that this note clarifies some of the assumedly confusing appearance of errors in the literature. Please do not hesitate to ask if you have questions about the comments above, or further questions about differential privacy generally.
But, fair cop, right? Sometime you miss something, and you own up to it.
It turns out that owning up is for losers.
From: Muralidhar, Krishnamurty Sent: Monday, January 04, 2010 8:08 AM
Dear Mr. McSherry,
We read your comments regarding our paper. If you examine our examples closely, you will find that our results hold true for both global and local sensitivity. Since our interest is from the intruder’s perspective, the fact that the provider uses local or global sensitivity is of little consequence. It is possible that this confusion arose because of a misinterpretation of our results. In our paper, we never claim that differential privacy is not satisfied in any of these cases. In fact, we show the contrary; we show that differential privacy as defined in Dwork (2006) is satisfied in all these cases. The primary thrust of our paper is that, even if differential privacy is satisfied, an intruder will be able to gain information that allows him/her to make decisions based on the response from the system that the Laplace based noise addition approach purports to prevent. It was an oversight on our part not to note that we used local (rather than global) sensitivity in our illustration, but we stand by our results.
Please do not hesitate to contact us if you have any questions related to masking numerical data.
We wish you a very Happy New Year.
Krish Muralidhar & Rathindra Sarathy
Now, I am pretty smart, but I still couldn't resolve the apparent contradiction of the proof I was familiar with, and these authors' claims. I was equally baffled by the statement:
we never claim that differential privacy is not satisfied in any of these cases
Because, like, their abstract says exactly that. If the other parts of their response aren't entirely clear to you, welcome to the club.
Anyhow, I thought I'd try and be helpful.
From: Frank McSherry Sent: Wednesday, January 06, 2010 6:51 PM
Thank you for your response. I have indeed examined your examples closely, but I have not found them to hold true for global sensitivity as well as local sensitivity. I thought I would work through the application of global sensitivity to your examples, and we could see where our understandings diverge.
The simplest example considers the summation of a collection of real numbers. To compute the global sensitivity, we must find an upper bound on the difference between the sums of any two data sets A and B differing on a single record:
GS(f) = Max |f(A) - f(B)| where A and B are arbitrary data sets differing on a single real number.
Unfortunately, for unrestricted real numbers, the global sensitivity is unbounded. A single entry can be arbitrarily large, and can cause the difference in sums to be just as arbitrarily large. The scale of the Laplace noise to add is therefore infinite, and the added noise completely obscures the true result. While this is not a very compelling release mechanism, it is not a demonstration that global sensitivity enables an intruder to learn details of the underlying records. In fact, Cynthia’s paper (and several others) contains proofs that this is not the case for global sensitivity (Theorem 4 in her paper).
One way to achieve a marginally more useful release mechanism is to place an a priori bound on the magnitude of the contribution of each record. For example, rather than "sum", we might decide to consider the function that first thresholds each number by, say, 1M, and then performs the summation. In this setting, we can conclude that the global sensitivity is bounded by 1M, as no matter the record in difference, its impact on the sum is at most 1M. Thus we perturb the true result by a Laplace random variable with parameter 1M/epsilon, and we absolutely do conclude that the probability of any outcome of the computation increases by at most a factor of exp(epsilon):
Pr[x|A] / Pr[x|B] = exp((|Sum(B) - x| - |Sum(A) - x|) * epsilon / 1M) <= exp(|Sum(A) - Sum(B)| * epsilon / 1M)
If A and B differ on a single record, |Sum(A) - Sum(B)| is at most 1M, and the above ratio of probabilities is bounded by exp(epsilon), for arbitrary adjacent data sets A and B, including the X and X* of your examples. The probability of "events", for example the response exceeding any particular value, follows from integration; as no output has its probability increased much, multiplicatively, the integral of any set of outputs also does not have its probability increased much, multiplicatively.
While +/- 1M is quite a lot of noise to introduce, if this quantity is replaced by a tighter bound on the range of elements, or enough elements are added, it becomes a less significant contribution. Other approaches have addressed the issue of accuracy for this type of computation by using smooth sensitivity (as previously described), or by considering other functions more in line with robust statistics (eg: the median, other order statistics, the inter-quartile distance). A noised version of the cumulative density function is also easy to produce, as the problem of counting the number of records above or below any threshold has much smaller global sensitivity (namely, one).
While directly adding global sensitivity based Laplace noise may not be the most accurate release technique for the data sets you are considering, it does not seem appropriate to present the shortcomings of approaches that have never been seriously proposed (local sensitivity) and assert that your negative findings generalize to approaches that have, for years, been supported by mathematical proof. The casual reader of your work may not realize that you have not faithfully reproduced existing differential privacy technology, which, your critique of local sensitivity notwithstanding, continues to provide among the very best of privacy guarantees.
That's the proof we saw up above in that awesome screen shot. It's pretty short. It would be a great opportunity for the authors to point at the specific lines that they take issue with.
Or they could wait a few weeks and then double-down on being wrong.
From: Muralidhar, Krishnamurty Sent: Thursday, January 21, 2010 9:01 AM
Dear Mr. McSherry,
A careful study of your email responses thus far appear to reiterate and validate the primary conclusions of our paper that Laplace Noise Addition and differential privacy cannot be applied to numerical data in any meaningful way. Again, the thrust of our paper is NOT that differential privacy is not satisfied; it is that even when differential privacy is satisfied (making the issue of local versus global sensitivity moot), it does not prevent an intruder from gaining information about the very data points that it is intended to protect.
As a case in point, you acknowledge that differential privacy can be applied only by "clamping the numerical data to a fixed range, perhaps via top-coding" clearly indicates to us that differential privacy fails to protect the very individuals it claims to protect, namely, those individuals whose confidential values are “influential” compared to the rest of the data set. This is precisely our point in the paper; Terry Gross's claim is substantially higher than the other individuals in the database and there is no way that a differential privacy based procedure can protect her data in a meaningful way.
We urge you to give our paper a more detailed reading and would be happy to answer any questions you may have.
Rathindra Sarathy & Krish Muralidhar
Ok, this is where I start to get a bit tweaked. The authors have now made some factually inaccurate statements, and are either clueless or dishonest when they say things like
you acknowledge that differential privacy can be applied only by "clamping the numerical data to a fixed range, perhaps via top-coding"
If you look at the text up above (the quote is from the first email), you see that I wrote:
Smooth sensitivity is one way to resolve this, among many other (e.g. clamping the numerical data to a fixed range, perhaps via top-coding).
A few more examples are given in the second email. Maybe a reading comprehension fail on their part? It would be a pretty epic fail, but it would also explain much of their other publications. Anyhow, let's finish this off:
From: Frank McSherry Date: Thu, Jan 21, 2010 at 8:33 PM
I think the problem we have is with a definition of terms. Differential privacy is a formal mathematical definition, and as far as the scientific community understands, it aligns very well with the protection of individual data. What you have presented in your paper is an approach that does not provide differential privacy. I entirely agree that the approach you present fails to protect Terry Gross, or anyone else in (or not in) the data set. However, this is not a shortcoming of differential privacy (which is not satisfied), nor of any previously proposed approaches based on the addition of Laplacian noise (none of which appear in your note). It is a shortcoming of attempting to apply these techniques without using the prescribed sensitivity value (infinity, if you were hoping to simply add up unbounded values; less, if you apply more robust techniques).
By way of analogy, I might write a paper about the failure of modern cryptography, because I always base my encryption on a secret key value of “1”. No cryptographer thinks this is a good idea, everything falls apart, none of the theorems continue to hold, and yet it is not evidence that any aspect of modern cryptography is lacking in anything other than more forceful instructions on how it should be properly applied.
The main conclusion from your note is the observation that, if not applied as formally described in prior work, Laplace noise does not provide differential privacy, and may lead to the leakage of information. This is a somewhat more restrictive conclusion than your paper claims, however, in its abstract:
“The results of this study indicate that the Laplace based noise addition procedure does not satisfy the requirements of differential privacy”
The reader is lead to believe that the definite article “the” modifying “Laplace based noise addition procedure” might refer to some Laplace based noise addition procedures that has been previously proposed (eg: based on global sensitivity, or smooth sensitivity), whereas it does not.
If your intent was a cautionary tale about misapplying the Laplace based noise techniques, I agree and apologize for the somewhat strongly-worded response, but this is certainly not the interpretation it appears to have provoked from our correspondents.
PS: Your statement that “Laplace Noise Addition and differential privacy cannot be applied to numerical data in any meaningful way” is demonstrably false. As evidence, I have thoughtfully attached the cumulative frequency function for the time between retransmitted packets in a network trace data set I was working with last night. The data are sourced in 100ns units, and the x-axis is in 1ms (= 10^6 ns) units. The values considered are as large as 640,000 in this figure, much larger than the insurance claims you were considering. Nonetheless, by using a [published] technique other than what you describe, I am indeed able to extract meaningful information from numerical data while providing differential privacy (0.2-differential privacy, in this case), through the addition of Laplace noise.
I didn't hear from them after that. Perhaps my picture actually demonstrated the errors of their ways?
The main event: Fool's Gold
Instead what happened was that the authors brought in a friend in the legal field, and they wrote a new paper. A bigger paper, full of breathless hyperbole and purple prose. This paper will be the subject of the rest of the post.
Their article starts
Legal scholars champion differential privacy as a practical solution to the competing interests in research and confidentiality, and policymakers are poised to adopt it as the gold standard for data privacy. It would be a disastrous mistake.
Differential privacy faces a hard choice. It must either recede into the ash heap of theory, or surrender its claim to uniqueness and supremacy.
I will present a third option: The authors could take a fucking stats class and stop intentionally misleading their readers.
(the following section is some text I wrote previously, edited by Cynthia Dwork and Deirdre Mulligan, and then a bit more by me):
The article opens with an example of a young internist hoping to identify a possible epidemic early by asking her city for the number of reported cases of a rare disease year by year. Being only as expert on differential privacy as the authors of that paper, the internist takes the first differentially private mechanism that occurs to her, and is distraught to find her questions badly answered.
While it is intended to be unsettling, the example shows differential privacy working exactly as promised: protecting unauthorized users from gaining overly specific information about small populations. The problems vanish if any of these three characteristics---unauthorized users, overly specific questions, and small populations---are removed. The problems of the internist lie in her expectations about how best to extract meaningful information from sensitive data, not in one of the tools she might choose to use.
While the internist is clearly a sympathetic character, we know this only because the authors assure us. If the character of the internist is replaced by a pharmaceutical representative or a law-enforcement officer, the story takes a slightly different flavor. If the internist's interests were less noble, perhaps about sexually transmitted diseases among her Facebook friends, we stop being quite so upset at differential privacy. If the internist can explain to her subjects why her queries shouldn't be subjected to privacy controls, by all means let her bypass them; until she does, she gets no special treatment.
The question posed by the internist (a count of the numbers of cases year by year) asks for more information than is needed to answer her question: are three cases abnormally high? The Laplace mechanism used in the example is the simplest of differentially private mechanisms, and the large volume of research in the area has led to many other ways of approaching data that produce more meaningful output, cost less in terms of privacy, and share the cost across multiple users. Any of these would have helped the internist get a more accurate answer.
None of this would be an issue if the population in question was not small. The internist is frustrated because her questions are about a small subpopulation: those who may have a rare disease. This is a fundamental feature of differential privacy: it protects details about small populations. This makes it very useful in cases where the small population may be arbitrarily sensitive (e.g. vocal political dissidents). In the absence of consent from the subpopulation, differential privacy takes the conservative position that it is their right to remain approximately anonymous.
The parable of the young internist tells us more about the sophistication of the paper’s authors than it does about the young internist’s potential. Should electronic medical health records be discarded because an emergency room doctor has a hard time remembering his password, putting his patients’ lives at risk? Should we reject the role of the round peg because it does not fill square holes? No. These examples, and that of the young internist, are silly cases of the misuse of very useful tools.
Differential privacy provides opportunities that wouldn't otherwise exist, but it doesn't solve all of the world's problems. This internist may need to get access to the raw data to do her job, which can be fine if other privacy controls are in place. Pretending that differential privacy should be used everywhere without question is obviously deranged: when doctors prescribe medication, should the result be differentially private? ("100ccs of raccoon, stat!") Obviously not.
On the other hand, I am very much in favor of having restrictive privacy controls like differential privacy be the default, especially for internists trained in privacy by the authors.
Part II: Stunning failures in application
This section of the article gives four examples of where differential privacy totally fails.
Part II explores the many contexts in which differential privacy cannot provide meaningful protection for privacy without sabotaging the utility of the data.
it continues with
... suggesting, at least in some cases, that the proponents of differential privacy do not themselves fully understand the theory.
I think someone doesn't understand differential privacy. Let's find out who.
Part IIa: The average Lithuanian woman
This section relates an oft-invoked example privacy concern that goes roughly as follows: Imagine we know that someone's height is two inches shorter than that of the average Lithuanian woman; would revealing that statistic violate the someone's privacy?
The example was originally constructed as part of a reductio ad absurdum argument: this "privacy concern" exists even if the someone isn't a Lithuanian woman (and so presumably isn't even in the dataset). This particular type of side information vexxed a great deal of folks who wanted to provide absolute privacy guarantees: the results of a query should reveal nothing specific that could not be known without the results of the query. This example is meant to be an "oh shit" moment for absolute privacy.
Differential privacy specifically changes its goals, protecting fewer but more specific secrets. The "differential" part of the definition explains the secrets it protects: the information you contribute to or withold from the input. If information about you could be learned without your participation in the computation's input, then the computation is not to blame for mis-managing your private information. So we think, anyhow.
Differential privacy is a way to clearly delineate between secrets that are yours to keep, and information that could be revealed by others. Without this line, you get only negative results about absolute privacy.
The authors spend a few paragraphs mystified by the example, and actually make some progress towards its conclusion: that the relative information about the heights is the problem, not accurate release of average heights.
That misunderstanding aside, the section continues with an investigation of how well Laplace noise addition would work with the average height of Lithuanian women generally, and women in the small town of Smalininkai (containing apparently 350 women). The conclusion is that the results are pretty accurate for Lithuania generally, and less accurate for Smalininkai.
Good, right? I mean, it's a much smaller population, and so we should be expected to learn much less. The authors close with this thought:
One could rationalize that smaller subgroups need more noise to protect the confidential information, but the exact same distribution of noise would also be applied to a database query system that uses a small but randomly selected (and unknown to the public) subsample of a larger population. So, if a world census allowed researchers to query average heights on a randomly selected sample of 120 Lithuanian women, the results would look just as bizarre as the ones reported in Table 2.
I can't really tell what their concern is here. My best guess is that they think that by randomly subsampling Lithuanian women, privacy is somehow well protected and you shouldn't have to add any more random noise. Given that this isn't actually good privacy, I'm glad differential privacy still protects these Lithuanians.
Anyhow, differential privacy working as intended. Clueless people don't get clear access to sensitive data about small populations, no matter how sure they are that it is safe.
Part IIb: Average of variables with long tails
The title is a bit telling about the issue with this section, which moves from heights to a quantity like salary that could have substantially larger variation. If any one person could have an epic salary, the Laplace mechanism might need to add an epic amount of noise to conceal that person (hypothetical or not) in the average.
This is totally correct. Differential privacy is very bad at this. Fortunately, averaging samples from long-tailed distributions is pretty much the domain of people who like to lie about statistics. So, it isn't too surprising to see it here.
It is generally understood that if you have values that can be arbitrarily large, the average isn't nearly as robust as the median. One individual can cause an average to shoot all over the place, either by having a large salary or just by being bad at filling out forms. Statisticians know this. If you check out the wiki for US household income, you get about 5x as many hits for "median" (96) as for "mean" (17). Note that I did not apply differential privacy to those counts, because that would be dumb.
The field of Robust Statistics studies statistical estimators that are less vulnerable to non-normal distributions (not bell-curve shaped). Things like the median are very robust! The cumulative density function is also pretty good. There is even a paper connecting Differential Privacy and Robust Statistics. You would think the authors would know about these options, because that picture I sent them years back is actually of a differentially private cumulative density function, which I said to them.
Differential privacy is indeed quite bad at determining the average of variables with long tails. At the same time, statisticians who report averages of such distributions are not very good at statistics. Statisticians who want information that does not fluxuate wildly with individual records have options:
Other approaches have addressed the issue of accuracy for this type of computation by using smooth sensitivity (as previously described), or by considering other functions more in line with robust statistics (eg: the median, other order statistics, the inter-quartile distance). A noised version of the cumulative density function is also easy to produce, as the problem of counting the number of records above or below any threshold has much smaller global sensitivity (namely, one).
That's the text from the email I send them those years back. The authors know this stuff exists, they just don't want to tell you about it. Instead, they write stuff like
When it comes to the analysis of continuous, skewed variables (like income), differential privacy’s strict and inflexible promises force a data producer to select from two choices: he can either obliterate the data’s utility, or he can give up on the type of privacy that differential privacy promises.
So, where are you on the whole "ignorance vs malice" spectrum at the moment?
Part IIc: Tables
To be fair to the authors, the next section is about "hey what if you count things instead", which is what a cumulative density function does, but they go with tables instead. Maybe because they really wanted to avoid illustrations.
Before we present the results, it is worth reflecting on the loss of utility that comes with the change of format. The accuracy of simple statistics from grouped histogram data is always compromised by the crudeness of the categories. Still, one might expect an improvement over the differential privacy responses for average income that we explored above.
Really? They are lamenting the loss of utility going from "averages of long-tailed variables" to something like the CDF. Statisticians, do you want to help me out here?
So, the authors discuss tabular data, staying with income for now. They write out tables containing counts for folks with various incomes, ranging from the small and reasonable up to 1 billion. In their example, the noised count of people with salaries in excess of 1 billion is five. Based on these noised measurements, they conclude:
A researcher using the responses above would conclude that the average income among Booneville residents is about $44 million.
Yes, well a stupid researcher would. However, if this researcher has taken an elementary class in probability and statistics, they understand that noisy measurements reflect information about the true data, but may not actually be the true data themselves. If you were to phone up 1,000 random people in New York and ask for their salaries, you might take the average you hear as a reasonable representative, or not. What you should't do is conclude that only 1,000 people in New York are employed. You need to pay attention to how you got the data, not just slot it in to your excel spreadsheet.
Fortunately, Bayes' rule got invented a long time ago, and tells us exactly how to do this. For each measurement, you use the probability of seeing the measurement given some true data, combined with your prior belief about the true data, to produce a new "posterior" belief about the data. Your prior belief about the number of billionaires in the population may be "99.999999% none", and seeing the number 5 updates this to "99.999998% none", because the number 5 isn't very strong evidence one way or the other.
Concluding "golly gee; there must be five" takes a special sort of education.
So, check with the statisticians working with your data. Ask them if they've heard about Bayes' rule (they have). If they haven't heard about Bayes' rule they probably aren't statisticians; they may be hack privacy researchers instead.
Part IId: Correlations
The problems of this next section actually start with the last paragraph in the previous section. There the authors observe that like count queries, other queries that are changed by at most 1 can be protected by the Laplace mechanism, and produce horrible results. They give the example of determining the average rate of income tax (presumably from 0.0 to 1.0; I'm not a tax lawyer), and observe that if you
- compute the average tax rate, and then
- apply enough Laplace noise to mask a contribution of 1,
you get abysmal results. It's true.
The subject of this section is correlations, measurements of dependence between variables, ranging from -1 to 1. A value of 1 indicates that the two variables are perfectly correlated, and a value of -1 indicates that they are perfectly anti-correlated. The authors work through the only technique they know: compute the correlation and then add enough noise to mask a contribution of 1. It does very badly indeed. What should we learn from this?
This is a horrible way to compute such an average. You went and added the noise you would add to large sums over many people to a single [-1.0, 1.0] number. What did you think would happen?
Fortunately, there are smarter ways to compute averages, other than first taking the average and then applying noise. And, of course, the authors know this. Earlier in their article, determining Lithuanian heights in Section IIa, the authors reveal that they do understand how to compute averages: sum the numerator and add noise, sum the denominator and add noise, then divide one by the other.
Do not first sum and divide, and then add noise. This entire section is predicated on the idea that you do this badly.
What the authors make clear here is that there is a difference between a good differentially private algorithm and a bad differentially private algorithm. The authors have invented a bad one, and want you to believe it is the only one.
With noise like this, differential privacy simply cannot provide a workable solution for analyses of correlations or any statistical measure that has a strict upper and lower bound.
This is true; "noise like that" sucks. But you are only using "noise like that" because you choose to remain willfully ignorant of other options that work much better. Options you used just a few subsections earlier...
Part II: Summary
There are four complaints in this section; let's review them:
Differential privacy provides inaccurate answers for small populations.
Check. That is its goal, and we do a great job. Thank you for noticing!
Laplace noise addition adds way too much noise to statistics that can fluctuate wildly with few users.
Also true. But for exactly those fluctuation reasons, statisticians use better statistical estimators. You should consider using them. Differential privacy seems rather compatible with the improved estimators. You are welcome!
People may think that noised counts are true counts, rather than noised counts.
I suppose. The way to get around this is education. We've written several papers connecting probabilistic inference with differential privacy. You could read them and then help inform people, rather than cynically tell readers that there is no choice but to do things badly.
Computing correlations and such, just by naively using the Laplace mechanism, works poorly.
Yes, this is why we didn't stop doing research once we wrote about the Laplace mechanism, even though that very paper contains the mechanisms you should have used for correlation. It turns out that there are lots of other options, which can be much better. If you look into them for just a few brief moments, or read your emails from a few years ago, you will discover some. Other options include: asking an expert, rather than pretending to be an expert.
The authors seem to be some mix of willfully ignorant, technically deficient, and intentionally misleading. I can't really tell what mix.
Part III: Blah blah
The article continues with a deconstruction of what went wrong with differential privacy. I read it, and it isn't really worth picking apart. It has a large number of complaints that are the same flavor as above, with the same flavor of response:
- Differential privacy protects small populations, no matter how desperately you wish it didn't.
- Differentially private techniques reports noised values, so bear that in mind when you use them.
- Differential privacy is a definition, and repeatedly picking stupid mechanisms doesn't mean you should be excused from it.
The section is worth a read for their takedowns of quite famous and well-regarded folks, using the findings from the previous sections as ammunition. I just don't see how it ends well for the authors.
Part IV: Conclusions
The conclusions mostly alternate florid prose with a few non-crucial misunderstandings about how differential privacy should be positioned. The prose is funnier, so let's copy that:
Differential privacy faces a hard choice. It must either recede into the ash heap of theory, or surrender its claim to uniqueness and supremacy.
In its pure form, differential privacy has no chance of broad application.
In its strictest form, differential privacy is a farce. In its most relaxed form, it’s no different, and no better, than other methods.
Adopting differential privacy as a regulatory best practice or mandate would be the end of research as we know it.
Actually, I really like this last one. "Research as we know it" at the moment seems to allow any random idiot to demand access to data, independent of the privacy cost to the participant. Instead, differential privacy, and techniques like it, align accuracy with privacy concerns, and challenge researchers to frame their questions in a way that best respects privacy. I'm ok with that change.
Here is the article's last line:
Lest we ... blah blah ... the legal and policy community must curb its enthusiasm for this trendy theory.
You could curb your enthusiasm. Or you could educate yourself. The authors of this paper, and in particular their levels of education, are a travesty.
Perhaps, independent of your ultimate take on differential privacy, the legal and policy community could curb its enthusiasm for incompetent and/or intentionally dishonest research articles. Vanderbilt Journal of Entertainment and Technology Law, I'm looking at you.