Uber's differential privacy .. probably isn't
Today we are going to talk through a recently accepted VLDB paper, Toward Practical Differential Privacy for SQL Queries. This paper is partly what is behind Uber's SQL differential privacy project, which they have been happily plugging.
Despite what you might guess, I actually think there is a fair bit to like in the goal of the paper, specifically what is implied by its title, and some of the technical development. The world could use more people aimed at the task of providing differential privacy to people who do not have their own advanced degree in differential privacy, and the framework in this paper is a fine zero-th step.
Unfortunately, the paper has several issues. Literal issues, but also some higher-level issues. I think these issues can be fixed, but I also think they should have been fixed. It is too bad that VLDB published the paper as it is, but perhaps the issues can be addressed before the paper is presented (or perhaps the authors can float this post in their presentation, and talk some serious smack while explaining my errors).
Let me start by re-iterating that the core ideas in the paper are sound. I think they are good, even (which if you know me, I DO NOT EVER SAY). The implementation and evaluation seem to me to be a bit of a shambles, and any conclusions about the merit of the work are left in a great cloud of doubt.
The highest level issues are:
There is only one experiment the authors perform using public data and queries. I tried to reproduce the results and could not. It appears the authors (unintentionally) over-reported the quality of their results by a factor of 100x: they computed relative error and then listed it as percent error. Unfortunately, they assessed the quality of the results as if percent error and were quite pleased. Instead, three of the five queries (those reported as using joins) are less accurate than simply reporting the value "zero" as the output without looking at the data (henceforth the "just report zero" mechanism). The other two relatively accurately queries are the ones that don't take advantage of the paper's contribution.
The reproducible experiments strongly suggest that the code does not correctly handle an important case in SQL queries: semijoins. We will explain what semijoins are, but they are a restricted form of one-to-many joins. Rather than raise an exception, the system instead just introduces an inappropriately small error and reports high accuracy (the correction appears to be another factor of 80x error for one of the queries).
The results on non-public data and queries are probably just as wrong with respect to semijoins (though probably not wrong with respect to "'percent' means 'out of one hundred'"). Worryingly, the reported accuracies for the non-public queries involving joins fall in two clusters: "surprisingly good" and "fairly mediocre"; it seems likely that some fraction of the first cluster may migrate to the second cluster (though if we use the paper's calibration, at most 1.0% of them).
In any case, on all of the public queries you can get strictly more accurate results from existing differential privacy infrastructure, by using one of i. PINQ, ii. wPINQ, or iii. the just-invented "just report zero" mechanism.
I do want to re-iterate, as loudly as I can, that everything I'm going to go through was made variously possible and much easier by the authors' release of the public parts of their experiments. While there are issues, I view the whole thing as a positive for science if we make progress towards the right answer.
If you aren't familiar with differential privacy, this may not be the best "first post" in order to learn about it. I'm not 100% sure what the best first post is, or whether there is one that works for all people.
From a high level, differential privacy is a guarantee made by a data analysis program, that it will not indirectly disclose your participation or non-participation in the input. A bit more formally, the guarantee is that if you run the program on two datasets that differ only in that you withhold one record (yours, let's say) from one of the inputs, then the probability the program produces any given output changes by not much of a multiplicative factor. Informally, this gives you the ability to say, for any possible output of the program, "well, that was just as likely with my data as without".
Now, actually there is allowed some shift in probability, just not too much. Differential privacy normally comes with a privacy parameter, usually the Greek letter epsilon; we might say an algorithm has the property of "epsilon-differential privacy". An algorithm provides epsilon-differential privacy when the probability of an output changes by at most a multiplicative factor of exp(epsilon), or "e raised to the epsilon power", for each input record that is added or removed.
We won't get in to whether differential privacy is a great definition or not. It does have several properties that make it internally consistent, so whether it solves the world's privacy problems or not at least it guarantees a specific thing: your participation in the input to the program has a bounded influence on the probability of annoying things happening to you. This doesn't keep weirdos from looking in your windows, or prevent Google and Facebook from mining your social interactions for profit, so privacy isn't actually solved yet.
One relatively common way to provide differential privacy is to add noise to the result of a computation before returning it as output. This has the ability to introduce uncertainty about the actual result, and therefore uncertainty about the specific input. For example, if you just want to count the number of records in the input satisfying a property, you can provide epsilon-differential privacy by adding noise drawn from a Laplace distribution divided by epsilon (so, increasing in magnitude as epsilon gets smaller and smaller).
The reason this works is that Laplace noise hides the specific count well; if we divide the noise by epsilon, then the probability of any specific noised output changes by at most a factor of exp(epsilon) when we change the true count by one in either direction. If adding or removing your input record changes the count by at most one, this is great.
Here is a picture of the Laplace distribution applied to two different counts. Notice that the ratio of the probability densities stays bounded, and is determined by the distance between the two counts.
If adding or removing your input record changes a count by more than one, there may be a problem.
If you come from a databases background, you know what SQL is. If you don't come from a databases background, SQL is the horrible thing that you imagine people typing into massive 1980s computers, except it is now 2018.
Here is an example SQL query that comes from the repository of the post's subject:
SELECT COUNT(*) FROM orders JOIN customers ON orders.customer_id = customers.customer_id WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
This query asks for the number of records (COUNT) from the
orders dataset for which the customer has an address containing
United States as a substring.
This sort of query is problematic for differential privacy, because although we understand how to add noise to counts, we don't really understand by how much the count may change if we add or subtract an input record. For example, if we were to remove my input record, and I had placed 1,000 orders, then the count will change by 1,000. Just adding Laplace noise divided by epsilon no longer provides epsilon-differential privacy; some outputs could be as much as a factor of
exp(1,000 * epsilon) more likely with my data present than without, or vice versa, which is in either case quite a lot.
The problem here is the
JOIN part of the query, which is what takes one record in one of the inputs (a
customer) and multiplies the effect of their presence by connecting them to multiple other input records (orders), and producing multiple output records as a result. There could be many
orders with the same customer field, and
JOIN is what causes these to result in multiple output records (or multiple contributions to
COUNT) for one input
A further problem is that the number 1,000 used above was totally made up. There is no reason the number couldn't be arbitrarily larger, and in fact it would depend on the data itself. Just adding Laplace noise to the result of this query does not provide differential privacy with any value of epsilon.
This led researchers, myself included, to look into ways to support the intended functionality of
JOIN without eating the differential privacy pain. Here are some examples (biased towards my recollection; my apologies to missed work):
One of the first instances I'm aware of is from Vibhor Rastogi et al.: Relationship privacy: output perturbation for queries with joins. In this paper they consider adding noise to produced counts, but do not achieve differential privacy. Instead, they get a different (weaker, I would say) guarantee against adversaries that are not too well informed (effectively: the adversary can't know enough to piece together the many bits of evidence your record may produce).
In the Privacy Integrated Queries work, I put together an implementation of a subset of LINQ queries (a SQL-like language), including a one-to-one join operator. This mean that it worked just fine when there were unique matches, but if multiple records matched the same key, they would produce no results. In the example up above, anyone with more than one order would contribute nothing to the count (wrong!). This is mitigated somewhat by the ability to pre-group relations like
customer_id, so that we match up each customer with the group of their orders as a single record. This allows you to look at the distributions of counts of orders by customer, which is neat but not what you asked for.
A bit later on, with Davide Proserpio and Sharon Goldberg in the wPINQ project, we worked out a many-to-many join, in which the contribution of records with many matches is decreased. For the count up above, naively implemented, you would most likely get the sum over orders of one-over the number of orders with the same
customer_idfield. If you do the math, that is basically just the sum of customers with at least one order, again not what you asked for. You can do less naive things, but we aren't going to get in to them just yet.
In short, there was not a resounding and unambiguous success here.
The present work
The present work, Toward Practical Differential Privacy for SQL Queries, describes an approach that fits a bit in between approaches 2. and 3. up above, if I had to judge. Rather than require exactly one match, or support a fluid number of matches by rescaling weights, this work pre-supposes maximum cardinalities for each key, from which they derive bounds on the amount by which one record might change the total.
For example, perhaps we assume that the
orders.customer_id field takes the same value at most 100 times, meaning any one customer has at most 100 orders. If that were known to be true, then we could conclude that the addition or removal of a
customer record could change the count we are computing by at most 100. Then we could just add Laplace noise divided by epsilon and multiplied by 100, and we would have epsilon-differential privacy.
It is not appropriate to just assume the number 100, nor is it appropriate to just look at the data and see that the number is 100 and use that. It is a bit more subtle, and the paper handles this correctly (to my understanding). Prior work on smooth sensitivity shows how you can determine such a value, and if you shade it upwards enough to account for how individual records may change the value you can get differential privacy. In this case, the authors get (epsilon, delta)-approximate differential privacy, which is not quite differential privacy.
The paper has some minor issues with formalism that I think are pretty easily fixed. These are the same challenges that the PINQ paper had, and I fixed those. Let me walk through them and explain the fixes.
The privacy definition the authors use here is a deprecated form of differential privacy, which says that you must mask the difference between changes to a record. That is, the same number of records exist in the dataset, but if someone alters a record then the probability must change by at most a factor of exp(epsilon).
There are some issues with this definition.
The first issue is that such a definition allows you to release the count of the records in your dataset in the clear. Changing a record cannot change the count of the records in your dataset, so you don't need to add any noise. Given that the current paper deals only with counting queries, you really want to get the privacy requirements around counting correct.
The authors reason about how transformations to the data can change the set of records, but do not appear to reason about the change to the count of records. A
SELECT .. WHERE ..in SQL) may return fewer records than in the input relation, and it is not acceptable to release their count in the clear (e.g. "how many records have
villain.name = "FRANK MCSHERRY""). A
joinoperation may increase the number of records in a data-dependent way, and it is not appropriate to return their count in the clear.
The paper doesn't return counts in the clear, but it is a fairly big problem that their formal guarantees do not preclude it.
The fix, introduced in the PINQ paper, is to use multisets of records, rather than n-tuples of records. You should think of your dataset as a function from records to integer counts (or real-valued counts, if you use wPINQ), rather than a sequence of n records.
This change alters the privacy guarantee. Each change to a multiset is an addition or a removal. It would take two such changes to effect one change in the prior definition, so in a sense the new guarantee is weaker by a factor of two. You recover this factor of two when you look at how much noise you need to add to queries (in their paper, the "elastic sensitivity" of histogram queries can remove its factor of two).
On the positive side, you actually protect the number of records in the dataset, and now have an internally consistent formalism that can reason about filters and joins.
The paper should make this modification. All of the details are in the PINQ paper; the authors can just copy them over.
Let's talk about the experimental section of the paper. The name of their implementation is "Flex", and we'll start using that now to distinguish it from the intended behavior of their formalism, because I think there is a pretty big distinction.
There are a good number of experiments, though most relate to private Uber workloads. The authors do consider a subset of the TCP-H benchmark, five queries involving counting on which they summarize the performance of their system (with epsilon = 0.1) as:
These numbers are wrong.
The authors have very thoughtfully provided a repository containing their code:
If you head there and check out their main program, you can totally reproduce the numbers in this figure. Unfortunately, the numbers are relative error, not percent error. They are wrong by a factor of 100x.
That sounds bad, but I want to stress that a good thing happened here. The authors made their techniques publicly available, and we were able to confirm that these specific results are not correct. In my opinion that is still better than correct results whose methods are not made publicly available.
Science is about getting to the right answer, and the authors did the right thing here with respect to science.
But, the results are incorrect, and by a factor of 100x. I hope the authors choose to correct them.
I also hope that VLDB chooses to correct their program committee, because these results were obviously wrong.
If we look at query 1, for example, the median population size is 1.47 million, and the reported accuracy is 0.000025%. This represents accuracy of counts to within plus-or-minus 0.3675. Counts change by roughly one when you change a record, and in the authors' reckoning such counts can change twice (because you can increase one count and decrease another). Add on top of this the privacy guarantee of 0.1-differential privacy, which should scale noise up by a factor of ten, the counts should be randomized by at least plus-or-minus 20.0, if the authors are applying their techniques as they described. In fact the authors have another (self-inflicted) factor of 2, which leads to the actual error of 36.75.
The same sanity check disqualifies Query 4: the reported errors are less than one for a query with 0.1-differential privacy. Query 16 also fails the sanity check and is wrong by a factor of at least 100x, though probably more than that (in a moment).
Queries 13 and 21 are also off by 100x, but their values were at least plausible. We will see how to achieve the 100x improved accuracies with wPINQ and "Just report zero", respectively.
The results for Query 16 are almost certainly doubly incorrect, because the authors mis-calculate the sensitivity of the query. By "doubly" I mean "a further factor of 80".
Query 16 asks (roughly): "for each part, how many suppliers without customer complaints supply the part?" It is a bit more involved, because we only ask this for a subset of parts, and then accumulate parts by (brand, type, size). However, the
parts relation is treated as public, and the only issue lies in the determination of the above query.
In the target dataset each supplier supplies eighty parts. If you make one change to an entry of the
supplier relation you could change 80 counts. Because the counts are aggregated, you could change the total counts by 80, even if you change fewer distinct counts (you might change some counts by five, for example, or just one count by 80).
I think the problem here is that query 16 does not contain the string
JOIN. Instead, it has what is called an "antijoin". This is like a join, where we correlate keys in two relations, but we only bring data in from one relation; the second relation only exists to filter the records in the first. In the case of query 16, this is from the fragment
and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' )
Notice that there is no
JOIN, just a
not in, which results in the antijoin between the part supplier relation and the supplier relation.
Flex should introduce a factor of 80 because
ps_suppkey has a maximum cardinality of 80, but Flex does not do so. If you check out the schema.yaml file for the experiments, there is no upper bound on the number of parts a supplier supplies. There could in principle be one supplier who supplies all parts; nothing appears to preclude this.
With a maximum cardinality of 80, where the relative error (not percent error) is reported as 8.832263, it should actually be at least 706.58104. It may need to be more, but the number 8.8 was already worse than the "just report zero" strategy (relative error 1) so fortunately we don't have to further revise our conclusions about Flex's performance on this query.
Query 4 asks for the counts by order priority of orders with at least one delayed line item. This query involves a semijoin (like an antijoin, but retaining matching records rather than dropping them), but the authors report it as having no joins. This is worrying but, possibly by dumb luck, ignoring the semijoin still gives the right privacy bound. This is because the
lineitem.order_id field has maximum cardinality one after one applies a hypothetical
DISTINCT operator to the set of order ids in lineitems (though not before).
I don't know what the results for query 4 will look like if the authors need to correct their implementation. The results don't have to be worse, but it could be that their analysis is not sophisticated enough to arrive at the most precise answer. PINQ and wPINQ do, though.
Using prior work
Let's quickly blast out some numbers for the queries using techniques from prior work. I'm going to compare PINQ, wPINQ, and "just report zero" with the reported and corrected accuracies for each of the TPC-H queries. I'm going to use relative error, rather than percent error, so "just report zero" is simply the number 1.
The PINQ and wPINQ numbers are from just one run of the computation, rather than 100 runs as the authors do. If anyone wants to get on my case about statistical fidelity, feel free. I think there may be a queue.
|Flex (reported)||Flex (actual)||PINQ||wPINQ||JR0||Best|
To be clear, the privacy guarantees of Flex and the other approaches are different; Flex provides the 2x stronger "changes" guarantee (with unclear count privacy) and the other variants provide the addition/removal protection.
Queries 1 and 4 are basically just counts, with a one-to-one semijoin in query 4. PINQ does great here, whereas wPINQ loses a factor of two on query 4 because its one-to-one semijoin supports fluid cardinalities; static analysis of the query could reveal that it doesn't need to, but it would need specialized PINQ operators to take advantage of that. Flex has about 4x the error of PINQ, with 2x due to their different privacy guarantee and 2x due to their approach always doubling the elastic sensitivity (they don't need to do so for these queries).
Query 4 is classified by Flex as "not having a join", which is wrong. It has a semijoin, and it is actually a good example of a query where one can do well even with joins. Both PINQ and wPINQ do well. Flex's performance may degrade once they correctly identify the join, but their currently reported error is differentially private at the levels they claim.
Query 13 is an example of where wPINQ shines. The query is asking for the number of customers with different numbers of orders about which they do not complain. For example, "how many customers made 5 orders without complaints?", for each value of 5. This ends up being just a relatively simple sum in wPINQ, performed exactly like a degree-distribution CCDF measurement for graphs (see the wPINQ paper), whereas it is seems to be a more complicated and less accurate join in Flex.
The wPINQ measurement for Query 16 is a bit of joke. It isn't meant to be impressive, but to point out that there are many ways to get good numbers on silly data. You can get a median relative error of literally 0, but by asking an indirect query and only because the data is statistically dumb. The best approach should be "just report zero" if the goal is to directly measure the counts. Instead, we can ask a query that, even with lots of noise, reports that roughly 10,000 of the 18,000 results are exactly the number 4; if we just use the number 4 for all output counts, we get a median error of zero. Tada!
We can even plot our zero error for query 16 on the logscale y-axis Figure 5, because it somehow has a line for zero. We don't even have to worry about the 100x correction.
Query 21 is silly. Each output record corresponds to an input record in the
suppliers relation. There is no reason anyone should perform well on this, and "just report zero" is probably the best answer here. You could provide some more accuracy using the same tricks at we played for query 16, but we don't need to go there (the result counts are all between one and twenty; if you discover this and use nine for all answers you should get median error less than one).
The paper presents the results of a survey of a great many Uber-internal queries. To my reading, they all pass the sanity check that the TPC-H queries failed, and they probably do not require a 100x rescaling. Unfortunately, none of us know or can find out, without the help of the authors (they have been asked; I am awaiting their response).
What is less clear is how many of the queries suffer from the semijoin defect. Some queries may be misclassified as join-free, and some queries may not have enough noise applied to them to provide differential privacy. Again, we really can't know the answer to that, but worryingly even the authors observe that there are two clusters of qualitative join accuracy (the worse of which are largely above the "100% error" line). The authors really need to sort out whether any of these queries are under-reporting error by failing to properly account for semijoins.
Draw your own.
But seriously. Wander back through the paper and check which conclusions are still supported by the data. I think the techniques in the paper are worth pursuing, if massively over-sold, but what should the average non-DP researcher conclude about the work? Is it time to start using these tools on actual sensitive data? Can we use Flex to protect the assignments of VLDB reviewers to papers, cause I have some harmless counting queries I'd like issue...
There is clearly a huge gap between optimized wPINQ analyses and Flex's measurements. Normal analysts shouldn't be expected to write optimized analyses, which means there is wide open space for query optimization work. I think this is great news, and relevance-minded SQL-boosters should totally start to ponder how we can begin to close that gap.
I think this paper contains a solid first step on the path of practical differentially private counting SQL queries. There are many things that need to be fixed, but the technical details can be fixed, and the paper is interesting (to me) even if you delete the experimental results, which may be appropriate.
If nothing else there should be either a fairly substantial revision, or a fairly substantial rebuttal.
I'm interested to see what path the authors take.