Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create plan/spec for storing usage data from frontend #426

Closed
kgodey opened this issue Jul 12, 2019 · 7 comments
Closed

Create plan/spec for storing usage data from frontend #426

kgodey opened this issue Jul 12, 2019 · 7 comments

Comments

@kgodey
Copy link
Member

@kgodey kgodey commented Jul 12, 2019

We want to store CC Search usage data to measure changes in people are using CC Search over time and to measure behavioral changes in A/B test experiments.

The scope of this issue is to figure out how to implement this, what backend or third-party service to use, and how to integrate it with our A/B testing infrastructure so that we can segment users based on any A/B tests they are a part of and see how behavior changes between variants. Please note that no PII should be stored and user data should be fully anonymized.

Data to store

Search term data

For every search, store the following data correlated with the search term:

  • which results the users click on
  • how many results the user clicked on for that search
  • how long the user scrolled before clicking on any result
  • how long until the user performed a second search and what that second search is
  • the relevance score of the search (between 1 and 5 stars)

As a fictional example, we should be able to calculate that for the search term "cat"

  • images with IDs 11, 23, and 56 are the top three results that users click on
  • users click an average of 5 search results before leaving the page or trying a new search, the maximum a user has clicked on is 23 results.
  • users view an average of 20 images before they click on any search result, the maximum a user has viewed is 400 images
  • users perform an additional search 56% of the time within an average of 80 seconds after the first search, the lowest time recorded is 3 seconds, the most is 10 minutes
  • users most commonly search for "cat drawing" after they search "cat"
  • users leave the page without clicking on any results 2% of the time
  • users rate the search relevance of "cat" an average of 3.9 stars, we have 3 1-star ratings and 7 5-star ratings

Image data

For every image, store the following data correlated with the image ID:

  • Number of times that attribution buttons are clicked
  • Number of times that users have taken an image reuse survey
  • Number of times that users clicked through to the source
  • Number of times that users clicked through to the creator
  • Number of times that the image was shared on social media
@kgodey kgodey transferred this issue from creativecommons/cccatalog-api Jul 12, 2019
@kgodey kgodey added this to To do in Active Sprint – CC Search via automation Jul 12, 2019
@kgodey kgodey added this to the Q3 Sprint 1 milestone Jul 12, 2019
@aldenstpage

This comment has been minimized.

Copy link
Contributor

@aldenstpage aldenstpage commented Jul 18, 2019

Implementation

I don't think there are any off-the-shelf tools that will fit this use case, at least without a lot of customization. I would suggest tracking all of these traits through server-side processing (implemented in the API layer):

  • which results the users click on
  • how many results the user clicked on for that search
  • how long until the user performed a second search and what that second search is
  • how long the user scrolled before clicking on any result (can be estimated from amount of pagination)

This one will of course need to be communicated to the backend from a front-end API call:

  • the relevance score of the search (between 1 and 5 stars)

Maybe someone who is a bit more of a Google Analytics expert (@brenoferreira or @kgodey) can tell me which of these can be feasibly tracked through GA:

  • Number of times that attribution buttons are clicked
  • Number of times that users have taken an image reuse survey
  • Number of times that users clicked through to the source
  • Number of times that users clicked through to the creator
  • Number of times that the image was shared on social media

If these can't be tracked through GA or if we want all the data in one place, we can have the frontend call some special purpose API endpoints for tracking these bits (they are client side events and are invisible to the server otherwise).

POST https://api.creativecommons.org/analytics/
{ "event": "attribution-buttonclick",  "image_id": "1234", . . . }

Other events might include took-survey, to-source, shared-social-media . . . Behind the scenes, the API will produce an alias for whatever IP called the analytics endpoint (see below).

Privacy

Please note that no PII should be stored and user data should be fully anonymized.

The metrics requested under the "Search term data" section require tracking individual user search sessions (that is, I need to know User A made a search, User B made a search, and then User A clicked on an image, as opposed to "two searches were made and then someone clicked on an image)".

Storing search session data over a long period of time, even when "anonymized", can be seriously problematic, as search queries by themselves can tell you a lot about an anonymous individual. One way I would propose to counter this is to assign an IP an alias (40.77.167.xxx -> FleetingDaftGiraffe) and ensure the IP address never enters the analytics pipeline. A new alias gets assigned every 24 hours, which presumably does a better job of preserving privacy than giving an IP address a permanent alias. This will reduce the risk of us inadvertently creating long-term profiles of our users and minimize the fallout from worst-case scenarios like a breach or leak. Ideally, we would not retain search session data for very long, even if we take care not to retain the user's IP address. The data can be aggregated anonymously shortly after it has been collected.

Data Architecture

Every analytics event results in the server creating a record in analytics_search_sessions (for session data that should eventually be deleted automatically at a pre-defined interval), or analytics_image append-only* tables in the API database.

Data from analytics_search_session can be occasionally aggregated into a analytics_search_terms and analytics_users tables for indefinite and anonymous storage. We can probably perform the aggregation and deletion of individual session data on a nightly basis.

The schema for analytics_search_terms might have fields like search_term, reporting_date, min_results_clicked, max_results_clicked, avg_results_clicked, common_followup_searches, no_results_clicked_percentage, min_followup_searches, user_rating_avg, and so on.

analytics_users would track all of the other aggregate stats not tied directly to a search term.

You can probably imagine what the analytics_image table would look like without me elaborating.

* Append-only tables with occasional DELETEs of old records where appropriate

@aldenstpage

This comment has been minimized.

Copy link
Contributor

@aldenstpage aldenstpage commented Jul 18, 2019

Some suggestions after I discussed this with @kgodey:

  • Would be a good idea to have a separate server outside of the API handling analytics requests to keep the API cleanly separated. Alternative is hiding analytics endpoints from the docs (messy, and also hides it from the intended users anyway). A lightweight CRUD analytics API service shouldn't be too much effort to put together, although it will take some additional hardware resources and infrastructure to manage.
  • Perhaps some of the listed client-side activity can be tracked through Google Analytics and then pulled in to our database from their API.
  • Instead of server-side events, everything should be done through analytics endpoints called from the frontend; that way, we don't need to deal with IP aliasing, we can just create a new session ID every time the user opens the site. The server can be configured not to log IPs at all. Downsides include more API calls happening from the user's browser and more complexity in the front end implementation.
@kgodey

This comment has been minimized.

Copy link
Member Author

@kgodey kgodey commented Jul 27, 2019

Closing this since the plan has been created.

@kgodey kgodey closed this Jul 27, 2019
Active Sprint – CC Search automation moved this from In progress to Done Jul 27, 2019
@neuroticnetworks

This comment has been minimized.

Copy link

@neuroticnetworks neuroticnetworks commented Sep 15, 2019

@kgodey and @aldenstpage You've moved on and that makes sense. I am curious though if differential privacy was (or could be) considered (at some point).

Do you have any thoughts about potentially using differential privacy and perhaps storing the aggregates longer term? Given the past history with GSOC, perhaps something like https://github.com/google/differential-privacy might be worth exploring?

Sorry if this is unhelpfully surfacing a dead cow. I feel like the privacy implications of storing these data are extremely important and deserve careful consideration.

Unrelated: it feels like DP might lead to interesting data sharing opportunities with CC, particularly if there was a clean interface to strip a data set of PII and attach differential privacy the contents. That may be unrealistic / way far off but I wanted to throw it out there and see if you have any thoughts about it. A CC search tool that included open, differentially private data sets would be pretty darn cool imo.

@kgodey

This comment has been minimized.

Copy link
Member Author

@kgodey kgodey commented Sep 18, 2019

We did not consider differential privacy but it looks like it's certainly worth considering. I'm not familiar enough with it to have any useful thoughts about it at the moment but if you had more thoughts to share or a proposal for us to consider implementing, please feel free to share here or open a new issue.

@aldenstpage

This comment has been minimized.

Copy link
Contributor

@aldenstpage aldenstpage commented Sep 19, 2019

We did take some measures to protect privacy by discarding IP information and only grouping a user's searches on a session-by-session basis (e.g. if you close the tab and visit CC Search again, your subsequent terms will be seen as a new user.) In spite of these protections, it would not be proper for us to release the data publicly.

There's a high quality paper on how to achieve differential privacy with search logs.

In this section, we introduce an search log publishing algorithm called ZEALOUS that has been independently developed by Korolova et al. [19] and us [12]. ZEALOUS ensures probabilistic differential privacy, and it follows a simple two-phase framework. In the first phase, ZEALOUS generates a histogram of items in the input search log, and then removes from the histogram the items with frequencies below a threshold. In the second phase, ZEALOUS adds noise to the histogram counts, and eliminates the items whose noisy frequencies are smaller than another threshold. The resulting histogram(referred to as the sanitized histogram) is then returned as the output.

It wouldn't be too difficult to generate ZEALOUS reports on a weekly basis and release them to the public.

@neuroticnetworks

This comment has been minimized.

Copy link

@neuroticnetworks neuroticnetworks commented Sep 19, 2019

it would not be proper for us to release the data publicly

There are two implications to this.

  1. You become responsible for protecting these data
  2. Fine tuning of the search algorithm has to be done "in house"; it's not as straightforward to source contributions from the community, since the community really can't access the same data that you do

I don't think that the idea warrants a full blown proposal at this point. It's a curiosity on my end and it really comes down to: What if you avoid tracking sensitive data altogether and instead create and then use differentially private datasets, eg, https://openreview.net/pdf?id=rJv4XWZA- and likewise the article that you shared.

A longer term benefit I could see is that a pipeline to create differentially private datasets also might lead to interesting data sharing opportunities under CC. Likewise, there might be opportunities to strengthen the work of your eventual Grant for the Web partners through data sharing.

There are only so many engineering cycles and I'm not sure this is worth doing. I think I'd honestly wait to see GFW evolves and evaluate on the basis of the directions those projects go in whether differential privacy would be a value add or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.