Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Create plan/spec for storing usage data from frontend #426
We want to store CC Search usage data to measure changes in people are using CC Search over time and to measure behavioral changes in A/B test experiments.
The scope of this issue is to figure out how to implement this, what backend or third-party service to use, and how to integrate it with our A/B testing infrastructure so that we can segment users based on any A/B tests they are a part of and see how behavior changes between variants. Please note that no PII should be stored and user data should be fully anonymized.
Data to store
Search term data
For every search, store the following data correlated with the search term:
As a fictional example, we should be able to calculate that for the search term "cat"
For every image, store the following data correlated with the image ID:
I don't think there are any off-the-shelf tools that will fit this use case, at least without a lot of customization. I would suggest tracking all of these traits through server-side processing (implemented in the API layer):
This one will of course need to be communicated to the backend from a front-end API call:
If these can't be tracked through GA or if we want all the data in one place, we can have the frontend call some special purpose API endpoints for tracking these bits (they are client side events and are invisible to the server otherwise).
Other events might include
The metrics requested under the "Search term data" section require tracking individual user search sessions (that is, I need to know User A made a search, User B made a search, and then User A clicked on an image, as opposed to "two searches were made and then someone clicked on an image)".
Storing search session data over a long period of time, even when "anonymized", can be seriously problematic, as search queries by themselves can tell you a lot about an anonymous individual. One way I would propose to counter this is to assign an IP an alias (40.77.167.xxx -> FleetingDaftGiraffe) and ensure the IP address never enters the analytics pipeline. A new alias gets assigned every 24 hours, which presumably does a better job of preserving privacy than giving an IP address a permanent alias. This will reduce the risk of us inadvertently creating long-term profiles of our users and minimize the fallout from worst-case scenarios like a breach or leak. Ideally, we would not retain search session data for very long, even if we take care not to retain the user's IP address. The data can be aggregated anonymously shortly after it has been collected.
Every analytics event results in the server creating a record in
The schema for
You can probably imagine what the
* Append-only tables with occasional DELETEs of old records where appropriate
Some suggestions after I discussed this with @kgodey:
Do you have any thoughts about potentially using differential privacy and perhaps storing the aggregates longer term? Given the past history with GSOC, perhaps something like https://github.com/google/differential-privacy might be worth exploring?
Sorry if this is unhelpfully surfacing a dead cow. I feel like the privacy implications of storing these data are extremely important and deserve careful consideration.
Unrelated: it feels like DP might lead to interesting data sharing opportunities with CC, particularly if there was a clean interface to strip a data set of PII and attach differential privacy the contents. That may be unrealistic / way far off but I wanted to throw it out there and see if you have any thoughts about it. A CC search tool that included open, differentially private data sets would be pretty darn cool imo.
We did not consider differential privacy but it looks like it's certainly worth considering. I'm not familiar enough with it to have any useful thoughts about it at the moment but if you had more thoughts to share or a proposal for us to consider implementing, please feel free to share here or open a new issue.
We did take some measures to protect privacy by discarding IP information and only grouping a user's searches on a session-by-session basis (e.g. if you close the tab and visit CC Search again, your subsequent terms will be seen as a new user.) In spite of these protections, it would not be proper for us to release the data publicly.
There's a high quality paper on how to achieve differential privacy with search logs.
It wouldn't be too difficult to generate ZEALOUS reports on a weekly basis and release them to the public.
There are two implications to this.
I don't think that the idea warrants a full blown proposal at this point. It's a curiosity on my end and it really comes down to: What if you avoid tracking sensitive data altogether and instead create and then use differentially private datasets, eg, https://openreview.net/pdf?id=rJv4XWZA- and likewise the article that you shared.
A longer term benefit I could see is that a pipeline to create differentially private datasets also might lead to interesting data sharing opportunities under CC. Likewise, there might be opportunities to strengthen the work of your eventual Grant for the Web partners through data sharing.
There are only so many engineering cycles and I'm not sure this is worth doing. I think I'd honestly wait to see GFW evolves and evaluate on the basis of the directions those projects go in whether differential privacy would be a value add or not.