Short version
Since SimHash floc IDs are just sums of vectors that correspond to individual domains, I think this version of floc could let large actors estimate traffic volume and aggregate demographics of visitors to other websites.
Related to #41 and #45, but I haven't seen this particular attack described yet.
disclaimer: I am not a chrome developer or a mathematician. if one of my assumptions here is off, please let me know!
Long version
The experimental version of FLoC uses SimHash, which is a deterministic mapping of browsing history -> floc ID. One of the project goals is to prevent sites/trackers from learning too much about any individual's browsing history. It should be impossible to use a single floc ID to determine with high likelihood whether a user visited a particular site. (longitudinal privacy is different, but leave that aside.)
But each floc ID will carry some information about the sites that are likely to make it up.
As best I can tell from here, SimHash in floc works like this:
- Each domain is hashed (deterministically) into a vector of gaussian random variables. So example.com might map to <0.51, -1.21, 0.98, ... >.
- The domains in a user's recent history are all hashed and summed up into a single vector. If the user has visited N domains, and domain di maps to vector xi, then the user's full floc vector is Sum(x1, ..., xN)
- Finally, the summed vector is mapped to a bit vector - the floc ID. Any negative elements of the vector become 0, and positive elements become 1.
At a high level, each site has its own floc vector. A user's floc vector is the sum of the floc vectors of all the sites they've visited, and the floc ID is a coarser version of that. You could also say each site has its own floc ID.
For each bit in a user's floc ID, and for each site they visited, there is a higher-than-50% probability that the bit in their floc ID matches the bit in the site's ID. For example, if you know a user visited a site with the 4-bit floc ID 1111, without knowing what else they visited, you know each bit in their floc ID is (slightly) more likely than not to be 1. Some sites might even have dramatic floc vectors -- with several vector values more than a couple standard deviations away from 0 -- which will have a higher impact on user floc IDs.
Now suppose you're the admin of a large site, and you see millions of floc IDs per day. You want to estimate how many of your readers also visit competitor.example. You might have an idea of competitor.example's traffic from a source like Alexa, which can serve as your prior belief.
Each floc ID you observe lets you perform a Bayesian update on your prior belief about how your readership overlaps with competitor.example. Say floc ID 11011 is slightly more likely than average to contain competitor.example, while ID 01100 is slightly less likely than average. Seeing a 11011 will boost your estimate of competitor.example's traffic, and an 01100 will deflate it. Each ID carries very little information, but millions of them could give you a pretty accurate idea of a specific site's volume.
If this works, you could also segment your own readership to figure out cross-traffic to competitor.example among different demographics. For example, U.S. readers of your site might be twice as likely to visit your competitor as other nationalities.
This would leak information about visitorship of all sites that are included in floc calculations. You could run experiments to find out just how accurate this method would be -- maybe it's so fuzzy as to be useless, but I think it's worth looking into.
This will also be a more valuable tool for actors who observe lots of traffic in lots of different contexts. Since floc only uses information about top-level frame navigations, it will only leak information about first-party traffic. Websites that don't own ad networks will reveal information about their traffic, while actors that receive lots of third-party requests will learn more information than they expose about themselves.
Short version
Since SimHash floc IDs are just sums of vectors that correspond to individual domains, I think this version of floc could let large actors estimate traffic volume and aggregate demographics of visitors to other websites.
Related to #41 and #45, but I haven't seen this particular attack described yet.
disclaimer: I am not a chrome developer or a mathematician. if one of my assumptions here is off, please let me know!
Long version
The experimental version of FLoC uses SimHash, which is a deterministic mapping of browsing history -> floc ID. One of the project goals is to prevent sites/trackers from learning too much about any individual's browsing history. It should be impossible to use a single floc ID to determine with high likelihood whether a user visited a particular site. (longitudinal privacy is different, but leave that aside.)
But each floc ID will carry some information about the sites that are likely to make it up.
As best I can tell from here, SimHash in floc works like this:
At a high level, each site has its own floc vector. A user's floc vector is the sum of the floc vectors of all the sites they've visited, and the floc ID is a coarser version of that. You could also say each site has its own floc ID.
For each bit in a user's floc ID, and for each site they visited, there is a higher-than-50% probability that the bit in their floc ID matches the bit in the site's ID. For example, if you know a user visited a site with the 4-bit floc ID 1111, without knowing what else they visited, you know each bit in their floc ID is (slightly) more likely than not to be 1. Some sites might even have dramatic floc vectors -- with several vector values more than a couple standard deviations away from 0 -- which will have a higher impact on user floc IDs.
Now suppose you're the admin of a large site, and you see millions of floc IDs per day. You want to estimate how many of your readers also visit competitor.example. You might have an idea of competitor.example's traffic from a source like Alexa, which can serve as your prior belief.
Each floc ID you observe lets you perform a Bayesian update on your prior belief about how your readership overlaps with competitor.example. Say floc ID 11011 is slightly more likely than average to contain competitor.example, while ID 01100 is slightly less likely than average. Seeing a 11011 will boost your estimate of competitor.example's traffic, and an 01100 will deflate it. Each ID carries very little information, but millions of them could give you a pretty accurate idea of a specific site's volume.
If this works, you could also segment your own readership to figure out cross-traffic to competitor.example among different demographics. For example, U.S. readers of your site might be twice as likely to visit your competitor as other nationalities.
This would leak information about visitorship of all sites that are included in floc calculations. You could run experiments to find out just how accurate this method would be -- maybe it's so fuzzy as to be useless, but I think it's worth looking into.
This will also be a more valuable tool for actors who observe lots of traffic in lots of different contexts. Since floc only uses information about top-level frame navigations, it will only leak information about first-party traffic. Websites that don't own ad networks will reveal information about their traffic, while actors that receive lots of third-party requests will learn more information than they expose about themselves.