New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEP 8: Gathering Django usage analytics #31

Closed
wants to merge 2 commits into
base: master
from

Conversation

@jacobian
Member

jacobian commented Nov 5, 2016

I want to start collecting some basic usage metrics so that it's easier for the DSF to raise money.

TODO before merging to master:

  • reference implementation
  • list prior art
  • clarify the cid bit and why we couldn't track a user even if we knew their id
  • other things I'm probably forgetting
Google Analytics vs other platforms/choices
-------------------------------------------
Using Google Analytics is a trade-off. On the one hand, Google's track record

This comment has been minimized.

@ericholscher

ericholscher Nov 5, 2016

Might I suggest running a proxy that sends this data along to GA? That way you can change to an API compatible endpoint in the future, without breaking deployed code. It would require running a proxy on your infra, but that is much less demanding than a full analytics install.

@ericholscher

ericholscher Nov 5, 2016

Might I suggest running a proxy that sends this data along to GA? That way you can change to an API compatible endpoint in the future, without breaking deployed code. It would require running a proxy on your infra, but that is much less demanding than a full analytics install.

This comment has been minimized.

@jacobian

jacobian Nov 6, 2016

Member

It's something I considered, and then discarded under the same reasoning as not running our own choice: I don't want to increase maintenance burden. That said, there are some good reasons to think about a proxy: the one you mentioned, as well as that it'll let us strip out the IP address which addresses the single remaining GA privacy concern. So might be worth thinking further here.

@jacobian

jacobian Nov 6, 2016

Member

It's something I considered, and then discarded under the same reasoning as not running our own choice: I don't want to increase maintenance burden. That said, there are some good reasons to think about a proxy: the one you mentioned, as well as that it'll let us strip out the IP address which addresses the single remaining GA privacy concern. So might be worth thinking further here.

This comment has been minimized.

@raphaelm

raphaelm Nov 6, 2016

I'd be much in favor of a proxy of that kind.

@raphaelm

raphaelm Nov 6, 2016

I'd be much in favor of a proxy of that kind.

This comment has been minimized.

@Alir3z4

Alir3z4 Dec 11, 2016

The proxy part is really interesting.
It helps in the future.

On another side it increases the increase maintenance burden as well.

@Alir3z4

Alir3z4 Dec 11, 2016

The proxy part is really interesting.
It helps in the future.

On another side it increases the increase maintenance burden as well.

@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Nov 5, 2016

This Beacon implementation in Sentry is one we've been thinking about adding to Read the Docs, for similar reasons; https://github.com/getsentry/sentry/blob/bfc711ed2579d8588f99170c75d974af3d4c8e96/src/sentry/tasks/beacon.py#L32 -- it's a bit of different idea, but is good prior art.

Of note, it also allows sending a response that includes a message -- which could be useful for security notices. This is probably out of scope for the Django implementation, but might be another added user benefit of "phoning home" in dev.

ericholscher commented Nov 5, 2016

This Beacon implementation in Sentry is one we've been thinking about adding to Read the Docs, for similar reasons; https://github.com/getsentry/sentry/blob/bfc711ed2579d8588f99170c75d974af3d4c8e96/src/sentry/tasks/beacon.py#L32 -- it's a bit of different idea, but is good prior art.

Of note, it also allows sending a response that includes a message -- which could be useful for security notices. This is probably out of scope for the Django implementation, but might be another added user benefit of "phoning home" in dev.

@alexwlchan

A few very minor spelling/grammar suggestions, but otherwise this seems like a pretty sensible proposal. 👍

much easier to approach organizations for funding. As Eghbal writes:
[W]ithout data about which tools are used, and how much we rely upon them,
[it is hard to paint a clear picture of what is underfunded.

This comment has been minimized.

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Extra square bracket has crept in.

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Extra square bracket has crept in.

Analytics will be sent when certain ``django-admin`` commands are run:
``startproject``, ``startapp``, and ``runserver``. If a settings file
can be loaded (i.e. for``startapp`` and ``runserver``), analytics will only

This comment has been minimized.

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Missing space between for and startapp.

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Missing space between for and startapp.

How will analytics be sent?
---------------------------
Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard

This comment has been minimized.

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Minor: should this be "HTTPS"?

@alexwlchan

alexwlchan Nov 5, 2016

Contributor

Minor: should this be "HTTPS"?

@dstufft

This comment has been minimized.

Show comment
Hide comment
@dstufft

dstufft Nov 5, 2016

Member

I mentioned this to @jacobian in IRC but I figured I'd mention it here as well.

While it skews somewhat towards "number of downloads (not installs, downloads)" if there's something that can be added to pip or PyPI to aid in this goal I'm definitely interested in it. Of course we have the same privacy goals there as well, but if there's things that can be added on that front to help Django (and other projects) we can absolutely make something happen there. I've wanted to do it for awhile and I've just lacked time.

Member

dstufft commented Nov 5, 2016

I mentioned this to @jacobian in IRC but I figured I'd mention it here as well.

While it skews somewhat towards "number of downloads (not installs, downloads)" if there's something that can be added to pip or PyPI to aid in this goal I'm definitely interested in it. Of course we have the same privacy goals there as well, but if there's things that can be added on that front to help Django (and other projects) we can absolutely make something happen there. I've wanted to do it for awhile and I've just lacked time.

@shaib

This comment has been minimized.

Show comment
Hide comment
@shaib

shaib Nov 5, 2016

Member

@dstufft Every time you install Debian, it asks you if you want to participate in a "popularity contest" which reports home which packages are installed. I guess we could add something like that to virtualenv. The package in Debian which takes care of this is called popcon.

Member

shaib commented Nov 5, 2016

@dstufft Every time you install Debian, it asks you if you want to participate in a "popularity contest" which reports home which packages are installed. I guess we could add something like that to virtualenv. The package in Debian which takes care of this is called popcon.

the Google Analytics application name (`aid`_) and application version (`av`_).
- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
This is generated by the Python standard library function ``uuid.uuid1()`` and

This comment has been minimized.

@aaugustin

aaugustin Nov 5, 2016

Member

You could use uuid4 instead to further minimize privacy concerns.

@aaugustin

aaugustin Nov 5, 2016

Member

You could use uuid4 instead to further minimize privacy concerns.

Who has access to analytics data?
---------------------------------
Access to the Google Analytics dashboard and data will be limited to the

This comment has been minimized.

@arikfr

arikfr Nov 5, 2016

Is it limited on purpose or because of the tool (GA)?

By design the data isn't supposed to contain any private data, so is there a reason not to share it with the public, assuming it's possible?

@arikfr

arikfr Nov 5, 2016

Is it limited on purpose or because of the tool (GA)?

By design the data isn't supposed to contain any private data, so is there a reason not to share it with the public, assuming it's possible?

This comment has been minimized.

@jacobian

jacobian Nov 6, 2016

Member

Yes: it's a "defense in depth" sort of thing. Privacy is a real concern here, and I want to put as many controls around respecting privacy as possible. If we've somehow got something wrong and there's a way to de-anonymize the raw data, I want to limit the number of people who might be able to do so. Sharing summaries and reports is totally something I hope we'll do, but keeping access to the raw data as restricted as is reasonable seems like a good idea.

@jacobian

jacobian Nov 6, 2016

Member

Yes: it's a "defense in depth" sort of thing. Privacy is a real concern here, and I want to put as many controls around respecting privacy as possible. If we've somehow got something wrong and there's a way to de-anonymize the raw data, I want to limit the number of people who might be able to do so. Sharing summaries and reports is totally something I hope we'll do, but keeping access to the raw data as restricted as is reasonable seems like a good idea.

@shaib

This comment has been minimized.

Show comment
Hide comment
@shaib

shaib Nov 6, 2016

Member

I think these can work much better if we don't try to do this on our own, but in coordination with other important open-source forces. I would love to find out what Debian and Fedora think of this; if they object, I suspect it could turn out ugly.

For the record, Debian's popcon is opt-in, but the installer makes sure to present the option. I'd feel much better about this if we can do something similar.

Member

shaib commented Nov 6, 2016

I think these can work much better if we don't try to do this on our own, but in coordination with other important open-source forces. I would love to find out what Debian and Fedora think of this; if they object, I suspect it could turn out ugly.

For the record, Debian's popcon is opt-in, but the installer makes sure to present the option. I'd feel much better about this if we can do something similar.

@raphaelm

This comment has been minimized.

Show comment
Hide comment
@raphaelm

raphaelm Nov 6, 2016

@aaugustin I'm at the sprints today, but not the package manager for debian, I'm not even a debian developer. You probably confused me with @rhertzog who is listed on the debian page together with @lfaraone and @brianmay.

However, I know that debian in the past patched these kind of things out of packages, I just can't remember what packages that were from the top of my head.

raphaelm commented Nov 6, 2016

@aaugustin I'm at the sprints today, but not the package manager for debian, I'm not even a debian developer. You probably confused me with @rhertzog who is listed on the debian page together with @lfaraone and @brianmay.

However, I know that debian in the past patched these kind of things out of packages, I just can't remember what packages that were from the top of my head.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Nov 6, 2016

Member

If major distros stripped the analytics, it would be a shame. The packager for Debian, Raphael Michel (@raphaelm), was at the DUTH sprints yesterday. Jacob, if you're still there, perhaps you can talk to him?

The order of magnitude would likely remain correct, though, due to virtualenv and pip being the dominant installation method.

Member

aaugustin commented Nov 6, 2016

If major distros stripped the analytics, it would be a shame. The packager for Debian, Raphael Michel (@raphaelm), was at the DUTH sprints yesterday. Jacob, if you're still there, perhaps you can talk to him?

The order of magnitude would likely remain correct, though, due to virtualenv and pip being the dominant installation method.

@shaib

This comment has been minimized.

Show comment
Hide comment
@shaib

shaib Nov 6, 2016

Member

@aaugustin the technical differences might be minor, but the major distros' positions would be very important for public reception.

Member

shaib commented Nov 6, 2016

@aaugustin the technical differences might be minor, but the major distros' positions would be very important for public reception.

@shaib

This comment has been minimized.

Show comment
Hide comment
@shaib

shaib Nov 6, 2016

Member

Another thought: if we're going there, we should probably be collecting usage of 3rd-parties -- at first I thought "apps", but perhaps more than that.
For apps, we could add a flag in AppConfig, defaulting to "no", which says "I want this app reported in the Django PopCon".
For other packages, this would probably require some setuptools API; I'm imagining some way for a "main" to ask its environment for a list of packages who want to be reported, and a way for a package to specify if it wants to be reported. Django could then use this API to do a real popcon. And I can easily see others using the same -- IPython & Anaconda are the first to come to mind.
@dstufft , thoughts?

Member

shaib commented Nov 6, 2016

Another thought: if we're going there, we should probably be collecting usage of 3rd-parties -- at first I thought "apps", but perhaps more than that.
For apps, we could add a flag in AppConfig, defaulting to "no", which says "I want this app reported in the Django PopCon".
For other packages, this would probably require some setuptools API; I'm imagining some way for a "main" to ask its environment for a list of packages who want to be reported, and a way for a package to specify if it wants to be reported. Django could then use this API to do a real popcon. And I can easily see others using the same -- IPython & Anaconda are the first to come to mind.
@dstufft , thoughts?

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Nov 6, 2016

Member

@raphaelm: sorry, I was thinking of Raphael Hertzog. Mea culpa!

Member

aaugustin commented Nov 6, 2016

@raphaelm: sorry, I was thinking of Raphael Hertzog. Mea culpa!

they couldn't track Django users. As far as we can tell, the only thing Google
could do would be to lie about anonymizing IP addresses, and attempt to match
users based on their IPs. If we discovered Google was lying about this,
we'd obviously stop using them immediately.

This comment has been minimized.

@rbarrois

rbarrois Nov 6, 2016

Unless passing the stats through a DSF-operated proxy, how would you "stop using [Google] immediately"?
By that time, there would be thousands of installed Django projects still sending analytics to Google.

@rbarrois

rbarrois Nov 6, 2016

Unless passing the stats through a DSF-operated proxy, how would you "stop using [Google] immediately"?
By that time, there would be thousands of installed Django projects still sending analytics to Google.

@rbarrois

This comment has been minimized.

Show comment
Hide comment
@rbarrois

rbarrois Nov 6, 2016

Could you expand on the rules for sending events w.r.t project × developer combinations?

For a given team project, when should events be sent:

  • Once for the project, no matter how many developers
  • Once per developer on the team
  • As many times as developers start the runserver command?

The last one might raise more privacy issues — it is likely run quite often throughout the development cycle, whereas startapp / startproject are run much less often.

rbarrois commented Nov 6, 2016

Could you expand on the rules for sending events w.r.t project × developer combinations?

For a given team project, when should events be sent:

  • Once for the project, no matter how many developers
  • Once per developer on the team
  • As many times as developers start the runserver command?

The last one might raise more privacy issues — it is likely run quite often throughout the development cycle, whereas startapp / startproject are run much less often.

as much as we want this data, collecting it is simply too invasive.
Another option would be to collect data on the admin usage (e.g. embedding
Google Analytics directly). A couple of Django projects (Wagtail and Oscar)

This comment has been minimized.

@tomdyson

tomdyson Nov 6, 2016

Wagtail doesn't embed Google Analytics, and although you haven't said this exactly, some readers could mistakenly conclude that it does. Would you mind removing the "(e.g. embedding Google Analytics directly)" clause to clarify this?

In case it's useful, Wagtail's approach is as follows:

  • for 'administrator' users (i.e. not standard editors) a non-blocking, client-side request is made to a text file on https://releases.wagtail.io, which is a CloudFront distribution
  • the text file contains the details of the latest stable version. If the current version is lower than the latest stable version, the administrator is alerted to the possibility of an upgrade
  • a script runs periodically to parse the CloudFront logs and report previously unseen domains (the referrer of the request)
  • it's opt-out. Site implementers can disable this behaviour with a documented settings config

We decided against the Google Analytics approach because we thought users would be nervous about enabling the transmission of server-to-server information, and because we only wanted to record the minimum useful usage data.

@tomdyson

tomdyson Nov 6, 2016

Wagtail doesn't embed Google Analytics, and although you haven't said this exactly, some readers could mistakenly conclude that it does. Would you mind removing the "(e.g. embedding Google Analytics directly)" clause to clarify this?

In case it's useful, Wagtail's approach is as follows:

  • for 'administrator' users (i.e. not standard editors) a non-blocking, client-side request is made to a text file on https://releases.wagtail.io, which is a CloudFront distribution
  • the text file contains the details of the latest stable version. If the current version is lower than the latest stable version, the administrator is alerted to the possibility of an upgrade
  • a script runs periodically to parse the CloudFront logs and report previously unseen domains (the referrer of the request)
  • it's opt-out. Site implementers can disable this behaviour with a documented settings config

We decided against the Google Analytics approach because we thought users would be nervous about enabling the transmission of server-to-server information, and because we only wanted to record the minimum useful usage data.

@thomasgoirand

This comment has been minimized.

Show comment
Hide comment
@thomasgoirand

thomasgoirand Nov 7, 2016

Hi,
Debian will, for sure, consider this as a privacy breach, and remove the code. If this isn't done, I'll file a bug, as for me, privacy is an important aspect of Debian. I would strongly recommend against hard-wiring this type of code without a fail-safe (ie: some easy way to disable it), and preferably, have it disabled by default.

As for my personal opinion about this, I'm also strongly against providing any information to Google. Not only this is a privacy breach, but also potentially a security problem (ie: making a query to Google may potentially inform about the version of Django that is running, which effectively can lead attackers to know what version of Django is running on the disclosed IP address).

Instead of this, why don't you just do a survey, and advertise about it in the manage.py command? Any form of advertising for such survey should be fine, provided that it doesn't do a privacy breach.

thomasgoirand commented Nov 7, 2016

Hi,
Debian will, for sure, consider this as a privacy breach, and remove the code. If this isn't done, I'll file a bug, as for me, privacy is an important aspect of Debian. I would strongly recommend against hard-wiring this type of code without a fail-safe (ie: some easy way to disable it), and preferably, have it disabled by default.

As for my personal opinion about this, I'm also strongly against providing any information to Google. Not only this is a privacy breach, but also potentially a security problem (ie: making a query to Google may potentially inform about the version of Django that is running, which effectively can lead attackers to know what version of Django is running on the disclosed IP address).

Instead of this, why don't you just do a survey, and advertise about it in the manage.py command? Any form of advertising for such survey should be fine, provided that it doesn't do a privacy breach.

@jezdez

I've added some inline comments to specific parts of this proposal.

In short, I'm in favor of collecting usage metrics (and we should use the term "metrics", not "analytics" for terminology and trademark reasons), but need to be way more defensive to protect the privacy of our users. As such I propose:

  • either use a proxy to never hit Google Analytics directly and to filter out data that we don't want at all
  • or build the analysis tooling ourselves (e.g. using Re:dash or Apache Airflow)
  • opt-out by default with a smarter prompt to enable it when needed
  • legal review and drafting of privacy statements to cover for the transfer of data to a 3rd party under US jurisdiction
  • adopt something like Mozilla's Data Privacy Principles: https://www.mozilla.org/en-US/privacy/principles/
  • make "Datenvermeidung und Datensparsamkeit" (principles of data reduction and data economy) a topic every Django team member understands and practices
@@ -0,0 +1,306 @@
=======================================
DEP 8: Gathering Django usage analytics

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

s/analytics/metrics/g

@jezdez

jezdez Nov 7, 2016

Contributor

s/analytics/metrics/g

Specification
=============
Starting in version XXX, Django gathers anonymous user analytics and report

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

s/anonymous user analytics/anonymous usage metrics/g

@jezdez

jezdez Nov 7, 2016

Contributor

s/anonymous user analytics/anonymous usage metrics/g

---------------------------
Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard
library.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

This should only happen via urllib2 on Python >= 2.7.9 or any other version that has the backported TLS cert validation feature. This should also say that the data is sent encrypted to Google Analytics.

@jezdez

jezdez Nov 7, 2016

Contributor

This should only happen via urllib2 on Python >= 2.7.9 or any other version that has the backported TLS cert validation feature. This should also say that the data is sent encrypted to Google Analytics.

This comment has been minimized.

@aaugustin

aaugustin Nov 7, 2016

Member

I don't expect that feature to land before Django drops support for Python 2 and I don't think it'll be backported to earlier releases.

@aaugustin

aaugustin Nov 7, 2016

Member

I don't expect that feature to land before Django drops support for Python 2 and I don't think it'll be backported to earlier releases.

This comment has been minimized.

@patrakov

patrakov Nov 7, 2016

This should not happen in China at all (Google is blocked, users will see ugly timeouts)

@patrakov

patrakov Nov 7, 2016

This should not happen in China at all (Google is blocked, users will see ugly timeouts)

Access to the Google Analytics dashboard and data will be limited to the
following people/groups:
- The DSF President, in their role providing oversight to the DSF.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

Not the board in general?

@jezdez

jezdez Nov 7, 2016

Contributor

Not the board in general?

This comment has been minimized.

@evildmp

evildmp Dec 5, 2016

I would suggest adding:

Members of the Django Software Foundation Board, upon application.
@evildmp

evildmp Dec 5, 2016

I would suggest adding:

Members of the Django Software Foundation Board, upon application.
- Members of the Django Technical Board, upon request.
- Members of the Django Infrastructure Team (so they can maintain the GA

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

s/Django Infrastructure Team/Django Ops team/g

@jezdez

jezdez Nov 7, 2016

Contributor

s/Django Infrastructure Team/Django Ops team/g

Users can disable analytics collection in two ways:
1. By setting an environment variable: ``export DJANGO_NO_ANALYTICS=1``.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

Double negative, let's use DJANGO_USAGE_METRICS=0

@jezdez

jezdez Nov 7, 2016

Contributor

Double negative, let's use DJANGO_USAGE_METRICS=0

This comment has been minimized.

@aaugustin

aaugustin Nov 7, 2016

Member

While I agree the double negative is annoying, I think it'd be easier if the API was "set the env var to anything not empty to change the behavior". What about DJANGO_DISABLE_METRICS=...?

@aaugustin

aaugustin Nov 7, 2016

Member

While I agree the double negative is annoying, I think it'd be easier if the API was "set the env var to anything not empty to change the behavior". What about DJANGO_DISABLE_METRICS=...?

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

@aaugustin The problem with that technique (seen this with pip's env var support) is when people set DJANGO_DISABLE_METRICS=no and expect it to work.

@jezdez

jezdez Nov 7, 2016

Contributor

@aaugustin The problem with that technique (seen this with pip's env var support) is when people set DJANGO_DISABLE_METRICS=no and expect it to work.

This comment has been minimized.

@jarshwah

jarshwah Dec 3, 2016

Member

Personally I'd like to see a "defacto" standard emerge for disabling tracking across a number of tools that have begun adding analytics. It'd be far easier for privacy minded people to set a single env var, and know that a number of tools will honour it, without having to set a new one for each tool. I'm not opposed to a django specific one, especially if setting it to a specific value enabled tracking, but I'd be in favour of a general fall back key.

ANALTICS_DISABLED="whatever" # force no collection, hopefully other tools will begin honouring too. All values will disable.
DJANGO_NO_ANALTICS= # 0 enables analytics, every other value disables.
@jarshwah

jarshwah Dec 3, 2016

Member

Personally I'd like to see a "defacto" standard emerge for disabling tracking across a number of tools that have begun adding analytics. It'd be far easier for privacy minded people to set a single env var, and know that a number of tools will honour it, without having to set a new one for each tool. I'm not opposed to a django specific one, especially if setting it to a specific value enabled tracking, but I'd be in favour of a general fall back key.

ANALTICS_DISABLED="whatever" # force no collection, hopefully other tools will begin honouring too. All values will disable.
DJANGO_NO_ANALTICS= # 0 enables analytics, every other value disables.
-- which raise some of the same concerns as above. So, we choose to only measure
upon certain commands that we can feel fairly certain won't be run in production.
This runs the risk of undercounting, but we think this is the best option.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

What about commands such as collectstatic that may or may not be used in development but could provide useful information about the usefulness of Django features, especially when deciding the fate of contrib apps?

@jezdez

jezdez Nov 7, 2016

Contributor

What about commands such as collectstatic that may or may not be used in development but could provide useful information about the usefulness of Django features, especially when deciding the fate of contrib apps?

We believe that collecting data by default is the only way we'll get a roughly
accurate measure of Django's usage.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

I'm strongly opposed to enabling this by default, even "just for developers".

Doing so would risk an important step in building trust to new Django users and destroying existing trust with old Django users when informing them of the data collection upon the first call of a management command.

While this DEP describes an option to opt-out, it's a high-enough barrier that only the privacy minded users will take those extra steps to actual disable the collection. Since our goal as an Open Source project should be not to sacrifice the essential right for privacy (philosophically speaking, not politically) of our users we need to let them decide if (and maybe when) to enable data collection, however anonymized it may be.

Instead what I would suggest is to prompt the user after a few calls to the previously described management commands (and assuming not using --noinput in those calls) whether to enable data collection or not. That prompt should have a short overview of what data is collected and sent to Google Analytics as an example and a link to more elaborate documentation similar to https://www.mozilla.org/en-US/privacy/firefox/.

The prompt should only show up after a few calls to the management commands (e.g. 5) to prevent disturbing "the first impression" of new users -- nothing is more of a downer as being asked to answer prompts if you're deep in a tutorial learning. The tutorial documentation should be amended to mention the possibility of this prompt.

@jezdez

jezdez Nov 7, 2016

Contributor

I'm strongly opposed to enabling this by default, even "just for developers".

Doing so would risk an important step in building trust to new Django users and destroying existing trust with old Django users when informing them of the data collection upon the first call of a management command.

While this DEP describes an option to opt-out, it's a high-enough barrier that only the privacy minded users will take those extra steps to actual disable the collection. Since our goal as an Open Source project should be not to sacrifice the essential right for privacy (philosophically speaking, not politically) of our users we need to let them decide if (and maybe when) to enable data collection, however anonymized it may be.

Instead what I would suggest is to prompt the user after a few calls to the previously described management commands (and assuming not using --noinput in those calls) whether to enable data collection or not. That prompt should have a short overview of what data is collected and sent to Google Analytics as an example and a link to more elaborate documentation similar to https://www.mozilla.org/en-US/privacy/firefox/.

The prompt should only show up after a few calls to the management commands (e.g. 5) to prevent disturbing "the first impression" of new users -- nothing is more of a downer as being asked to answer prompts if you're deep in a tutorial learning. The tutorial documentation should be amended to mention the possibility of this prompt.

This comment has been minimized.

@jarshwah

jarshwah Dec 3, 2016

Member

I'm in favour of an opt out system, mainly for the reasons in the DEP.

  1. Most people don't care about tracking, and will not change the defaults no matter what.
  2. Those that do care about tracking will specifically disable it.
  3. Some small portion of users (true fans) will enable tracking.

If this were to be an opt-in system, it's totally useless. You miss the analytics of the largest group of users who either don't mind, or don't care enough to mind. Opt in systems don't work. Defaults matter. There will be a small, extremely vocal, group of privacy advocates who will swear off Django and call us all sorts of names. We have to weigh up the cost/benefit ratio and decide if that matters. If it does, then I wouldn't even bother with an Opt In system.

@jarshwah

jarshwah Dec 3, 2016

Member

I'm in favour of an opt out system, mainly for the reasons in the DEP.

  1. Most people don't care about tracking, and will not change the defaults no matter what.
  2. Those that do care about tracking will specifically disable it.
  3. Some small portion of users (true fans) will enable tracking.

If this were to be an opt-in system, it's totally useless. You miss the analytics of the largest group of users who either don't mind, or don't care enough to mind. Opt in systems don't work. Defaults matter. There will be a small, extremely vocal, group of privacy advocates who will swear off Django and call us all sorts of names. We have to weigh up the cost/benefit ratio and decide if that matters. If it does, then I wouldn't even bother with an Opt In system.

This comment has been minimized.

@MMeent

MMeent Dec 3, 2016

In my opinion, the user must explicitly consent with the collection of data. This may be through a prompt of 'Do you agree on sending some basic info to Django for statistical reasons', but the user has to explicitly agree on the sending of data.
You cannot just start collecting data that you (the Django project) use for your own gains without the consent of the user.

@MMeent

MMeent Dec 3, 2016

In my opinion, the user must explicitly consent with the collection of data. This may be through a prompt of 'Do you agree on sending some basic info to Django for statistical reasons', but the user has to explicitly agree on the sending of data.
You cannot just start collecting data that you (the Django project) use for your own gains without the consent of the user.

one out there.
We've carefully chosen what to send to GA so that even if Google turns evil
they couldn't track Django users. As far as we can tell, the only thing Google

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

With respect to you, "if Google turns evil" is a pretty handwavy thing to say. Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services, so I think we should acknowledge that if we send data to Google it's to an entity that is under such jurisdiction. So this isn't about "evilness" but simply about the legal framework under which Google works, which a global audience such as the Django users need to understand and acknowledge when using Django while it sends the metrics.

I would argue that for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option. E.g. there exists the legal obligation to note the use of data analysis tools such as Google Analytics in the "imprint", terms of service and privacy statements (Datenschutzrichtlinie) of websites.

@jezdez

jezdez Nov 7, 2016

Contributor

With respect to you, "if Google turns evil" is a pretty handwavy thing to say. Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services, so I think we should acknowledge that if we send data to Google it's to an entity that is under such jurisdiction. So this isn't about "evilness" but simply about the legal framework under which Google works, which a global audience such as the Django users need to understand and acknowledge when using Django while it sends the metrics.

I would argue that for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option. E.g. there exists the legal obligation to note the use of data analysis tools such as Google Analytics in the "imprint", terms of service and privacy statements (Datenschutzrichtlinie) of websites.

This comment has been minimized.

@aaugustin

aaugustin Nov 7, 2016

Member

@jezdez Do you know if fully anonymized metrics still fall under privacy protection laws? Perhaps because the IP address is considered personal data? (I'm not sure about the latest developments on whether IP addresses are personal data.)

@aaugustin

aaugustin Nov 7, 2016

Member

@jezdez Do you know if fully anonymized metrics still fall under privacy protection laws? Perhaps because the IP address is considered personal data? (I'm not sure about the latest developments on whether IP addresses are personal data.)

This comment has been minimized.

@aaugustin

aaugustin Nov 7, 2016

Member

Also this raises the question if the mere possibility of getting the IP address from the network connection is an issue. Since don't store it, I'm not sure it's a problem.

@aaugustin

aaugustin Nov 7, 2016

Member

Also this raises the question if the mere possibility of getting the IP address from the network connection is an issue. Since don't store it, I'm not sure it's a problem.

This comment has been minimized.

@jezdez

jezdez Nov 7, 2016

Contributor

IP addresses are PII when they are linked to a person verifiably, e.g. they are static and not dynamically set like with dial-up and some forms of broadband connections. And even then since ISPs store IP addresses, especially with recent data retention laws coming into effect, they are linked to customer records and are linked indirectly that way. Either way, it's a grey area, at which the safest bet for Django and our users is to not store them at all (data reduction principle).

Some more details about this: https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/

@jezdez

jezdez Nov 7, 2016

Contributor

IP addresses are PII when they are linked to a person verifiably, e.g. they are static and not dynamically set like with dial-up and some forms of broadband connections. And even then since ISPs store IP addresses, especially with recent data retention laws coming into effect, they are linked to customer records and are linked indirectly that way. Either way, it's a grey area, at which the safest bet for Django and our users is to not store them at all (data reduction principle).

Some more details about this: https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/

This comment has been minimized.

@aaugustin

aaugustin Nov 8, 2016

Member

The discussion seems to be converging on the use of some sort of proxy, which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves.

If we strip them entirely, we lose the ability to analyze the geographical distribution of Django users.

@aaugustin

aaugustin Nov 8, 2016

Member

The discussion seems to be converging on the use of some sort of proxy, which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves.

If we strip them entirely, we lose the ability to analyze the geographical distribution of Django users.

@rhertzog

This comment has been minimized.

Show comment
Hide comment
@rhertzog

rhertzog Nov 7, 2016

Hello, as a Debian packager, I can say that this would likely be disabled by default in Debian.

I would rather suggest something more visible and more intrusive: if ~/.local/django/developer-id does not exist, then the strartproject/startapp/runserver run a new command "registerdeveloper" which invites the developer to register in some way. He can just decide to share its existence throught a random uuid assigned to him (and stored in the above file) or he can share more if he wishes (I'll let you figure out what interests you).

For scripting purpose, you could do "django-admin registerdeveloper --disable" so that you are not bugged with the interactive questions when you don't want them.

Associated to the developer id, there could be a timestamp so that each year the developer is invited to update/expand its entry.

rhertzog commented Nov 7, 2016

Hello, as a Debian packager, I can say that this would likely be disabled by default in Debian.

I would rather suggest something more visible and more intrusive: if ~/.local/django/developer-id does not exist, then the strartproject/startapp/runserver run a new command "registerdeveloper" which invites the developer to register in some way. He can just decide to share its existence throught a random uuid assigned to him (and stored in the above file) or he can share more if he wishes (I'll let you figure out what interests you).

For scripting purpose, you could do "django-admin registerdeveloper --disable" so that you are not bugged with the interactive questions when you don't want them.

Associated to the developer id, there could be a timestamp so that each year the developer is invited to update/expand its entry.

@raphaelm

This comment has been minimized.

Show comment
Hide comment
@raphaelm

raphaelm Nov 7, 2016

Just so that this has been mentioned: Besides the privacy issue, we also need to make sure that this call is done in a non-blocking way. It is not acceptable that this makes runserver slower to use when for example developing while tethering over a bad cellular connection (which I often do).

raphaelm commented Nov 7, 2016

Just so that this has been mentioned: Besides the privacy issue, we also need to make sure that this call is done in a non-blocking way. It is not acceptable that this makes runserver slower to use when for example developing while tethering over a bad cellular connection (which I often do).

@adamchainz

This comment has been minimized.

Show comment
Hide comment
@adamchainz

adamchainz Nov 7, 2016

Member

The DEP states:

potential sponsors always ask for data

and then:

A major reason that fundraising remains difficult is our inability to measure
the size of the Django community.

and then:

Our goal is to try to measure "unique developers"

However in-between there isn't much discussion as to whether 'community size'
is really what sponsors want to see, or that other metrics than 'unique
developers' have been considered. There are a few bracketed clauses about why
some kinds of metrics aren't useful ("e.g. number of times Django's been
installed"), but it doesn't clearly explain to me why tracking developers has
been settled upon.

One alternative would be to estimate what percentage of the top N websites (by
estimated traffic) use Django, using simple signals such as the presence of
/admin/ or the CSRF token. This is public data by virtue of the site being
online.

In fact, with a few minutes of Googling I can find two websites that already
use such techniques to track usage of Django (plus other tools) -
Siftery tracks 1923 sites and
Builtwith tracks 46,000.

I'd like to see the DEP consider several metrics (including non-invasive
profiling), and then justify its the choice of tracking developers.

Member

adamchainz commented Nov 7, 2016

The DEP states:

potential sponsors always ask for data

and then:

A major reason that fundraising remains difficult is our inability to measure
the size of the Django community.

and then:

Our goal is to try to measure "unique developers"

However in-between there isn't much discussion as to whether 'community size'
is really what sponsors want to see, or that other metrics than 'unique
developers' have been considered. There are a few bracketed clauses about why
some kinds of metrics aren't useful ("e.g. number of times Django's been
installed"), but it doesn't clearly explain to me why tracking developers has
been settled upon.

One alternative would be to estimate what percentage of the top N websites (by
estimated traffic) use Django, using simple signals such as the presence of
/admin/ or the CSRF token. This is public data by virtue of the site being
online.

In fact, with a few minutes of Googling I can find two websites that already
use such techniques to track usage of Django (plus other tools) -
Siftery tracks 1923 sites and
Builtwith tracks 46,000.

I'd like to see the DEP consider several metrics (including non-invasive
profiling), and then justify its the choice of tracking developers.

- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
This is generated by the Python standard library function ``uuid.uuid1()`` and
stored in ``~/.config/djangoanalytics`` (or equivalent on non-Linux

This comment has been minimized.

@adamchainz

adamchainz Nov 7, 2016

Member

I work on several Django projects mounted in VM's, and rebuild these VM's every few weeks. With this tracking scheme I would be counted as a new developer once for each project and every time I rebuild any VM. This problem would be compounded with Docker setups where developers might rebuild their containers several times a day.

@adamchainz

adamchainz Nov 7, 2016

Member

I work on several Django projects mounted in VM's, and rebuild these VM's every few weeks. With this tracking scheme I would be counted as a new developer once for each project and every time I rebuild any VM. This problem would be compounded with Docker setups where developers might rebuild their containers several times a day.

Rationale
=========
The high-level rationale is explained in the `Abstract`_: gather data that we

This comment has been minimized.

@willingc

willingc Nov 7, 2016

Minor edit: remove extra space between that and we

@willingc

willingc Nov 7, 2016

Minor edit: remove extra space between that and we

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Nov 7, 2016

Member

There's little point in gathering metrics if they're disabled by default. Most people will stick with the default (basically hit enter-enter-enter-enter until Django stops asking questions and move on). Then we're back to the kind of results a community survey would provide. We already know that's insufficient to convince potential sponsors (as explained in the DEP).

Thomas Goirand's comment suggests that vocal activits will do whatever needed to make downstream distributors strip the metrics code, for philosophical reasons that aren't open for discussion. As far as I can tell, it's not about how Django gathers metrics, it's just about the principle of gathering metrics. How that hurts Free Software isn't a consideration. I'm afraid it's a holy war (dare I say jihad?) and it will be hard to escape.

Django's liberal licensing allows distributors to change the code. Certainly we don't want to get into a trademark fight à la Firefox vs. Iceweasel over this. In my opinion that'll just be another reason to encourage users not to install Django with system packages. We already believe virtualenv/pyvenv is the better option for technical and practical reasons.

It's sad that we'll sound like we criticise distributors when we have to explain why metrics are underevaluated. I guess that's life.


On a more constructive note, I think that describing more precisely what we want to measure and giving more control to developers who want to trigger metrics in the right circumstances would be a good thing.

Member

aaugustin commented Nov 7, 2016

There's little point in gathering metrics if they're disabled by default. Most people will stick with the default (basically hit enter-enter-enter-enter until Django stops asking questions and move on). Then we're back to the kind of results a community survey would provide. We already know that's insufficient to convince potential sponsors (as explained in the DEP).

Thomas Goirand's comment suggests that vocal activits will do whatever needed to make downstream distributors strip the metrics code, for philosophical reasons that aren't open for discussion. As far as I can tell, it's not about how Django gathers metrics, it's just about the principle of gathering metrics. How that hurts Free Software isn't a consideration. I'm afraid it's a holy war (dare I say jihad?) and it will be hard to escape.

Django's liberal licensing allows distributors to change the code. Certainly we don't want to get into a trademark fight à la Firefox vs. Iceweasel over this. In my opinion that'll just be another reason to encourage users not to install Django with system packages. We already believe virtualenv/pyvenv is the better option for technical and practical reasons.

It's sad that we'll sound like we criticise distributors when we have to explain why metrics are underevaluated. I guess that's life.


On a more constructive note, I think that describing more precisely what we want to measure and giving more control to developers who want to trigger metrics in the right circumstances would be a good thing.

@rhertzog

This comment has been minimized.

Show comment
Hide comment
@rhertzog

rhertzog Nov 7, 2016

@aaugustin There's some truth in what you say, but please don't use big words like jihad. Debian with its policy provides security to its user because precisely we bring some third-party review between upstreams and themselves.

Phoning home is bad enough, but if you opt to use Google, then I'm pretty sure it won't fly. It's a matter of trust... Google will have access to data that they claim they will not store. Django has no way to verify that.

I made a suggestion a few minutes ago, I would like to hear your thoughts on it instead of being considered as a member of an extremist project.

I am particularly interested by the problem of funding free software and I agree with you that the metrics would be useful. But even more useful would be a database of Django developers and Django-using companies. So why not build a service to manage such a database and make it trivial to feed that database from the command line client? Such a service would be useful not only to Django but to the wider free software ecosystem. I would likely (try to) deploy it for Debian too...

rhertzog commented Nov 7, 2016

@aaugustin There's some truth in what you say, but please don't use big words like jihad. Debian with its policy provides security to its user because precisely we bring some third-party review between upstreams and themselves.

Phoning home is bad enough, but if you opt to use Google, then I'm pretty sure it won't fly. It's a matter of trust... Google will have access to data that they claim they will not store. Django has no way to verify that.

I made a suggestion a few minutes ago, I would like to hear your thoughts on it instead of being considered as a member of an extremist project.

I am particularly interested by the problem of funding free software and I agree with you that the metrics would be useful. But even more useful would be a database of Django developers and Django-using companies. So why not build a service to manage such a database and make it trivial to feed that database from the command line client? Such a service would be useful not only to Django but to the wider free software ecosystem. I would likely (try to) deploy it for Debian too...

@adamchainz

This comment has been minimized.

Show comment
Hide comment
@adamchainz

adamchainz Nov 7, 2016

Member

a database of Django developers and Django-using companies

N.B. there's actually such a database already at https://www.djangosites.org/ listing 5179 websites, though it's a third-party project for which the code hasn't been updated since 2013.

Member

adamchainz commented Nov 7, 2016

a database of Django developers and Django-using companies

N.B. there's actually such a database already at https://www.djangosites.org/ listing 5179 websites, though it's a third-party project for which the code hasn't been updated since 2013.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Nov 7, 2016

Member

@rhertzog I appreciate that you're open to discussion and constructive. For the avoidance of doubts, I believe that the majority of Debian Developers are also willing to have a productive discussion. It's less clear to me that the majority can prevail on such discussions on the Internet; that's the risk I'm afraid we have to prepare for.

Regarding your suggestion, I agree that we need to give specific examples of what the user experience looks like. I'm less enthusiastic about the idea of "registering"; you shouldn't have to register anywhere to use Django.

The current proposal says:

Metric collection will be enabled by default, and users will be warned the first time they run django-admin.

Perhaps it could look like this?

$ django-admin startproject foobar
This is the first time you're running django-admin on this system.

django-admin anonymously reports your version of Python and Django when you
run startproject, startapp, or runserver. These metrics allow the Django
Software Foundation to estimate the size of the community and to raise money
from sponsors. For details, see https://.../.

As a small contribution to the DSF, we ask you to accept reporting these
metrics. If you prefer, you can also donate to the DSF here: https://.../.

To disable metrics, set the DJANGO_DISABLE_METRICS environment variable to 1.

Continue? [Y/n]

@rhertzog: is that the sort of "more visible and more intrusive" thing you had in mind?

@jezdez: do you think that would be a sufficient opt-in to alleviate that part of your concerns?

I can't say if that's compatible with what @jacobian has in mind. We'll need his input when he's back from Django: Under the Hood and has caught up with the comments.

Member

aaugustin commented Nov 7, 2016

@rhertzog I appreciate that you're open to discussion and constructive. For the avoidance of doubts, I believe that the majority of Debian Developers are also willing to have a productive discussion. It's less clear to me that the majority can prevail on such discussions on the Internet; that's the risk I'm afraid we have to prepare for.

Regarding your suggestion, I agree that we need to give specific examples of what the user experience looks like. I'm less enthusiastic about the idea of "registering"; you shouldn't have to register anywhere to use Django.

The current proposal says:

Metric collection will be enabled by default, and users will be warned the first time they run django-admin.

Perhaps it could look like this?

$ django-admin startproject foobar
This is the first time you're running django-admin on this system.

django-admin anonymously reports your version of Python and Django when you
run startproject, startapp, or runserver. These metrics allow the Django
Software Foundation to estimate the size of the community and to raise money
from sponsors. For details, see https://.../.

As a small contribution to the DSF, we ask you to accept reporting these
metrics. If you prefer, you can also donate to the DSF here: https://.../.

To disable metrics, set the DJANGO_DISABLE_METRICS environment variable to 1.

Continue? [Y/n]

@rhertzog: is that the sort of "more visible and more intrusive" thing you had in mind?

@jezdez: do you think that would be a sufficient opt-in to alleviate that part of your concerns?

I can't say if that's compatible with what @jacobian has in mind. We'll need his input when he's back from Django: Under the Hood and has caught up with the comments.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Nov 7, 2016

Member

The other idea -- gathering a list of companies using Django -- was briefly discussed during the Django: Under the Hood sprints but ruled out for privacy reasons.

We don't really want to know who's using Django and we're not looking forward to being responsible for a list of websites that you can hack if you find a zero-day in Django...

Member

aaugustin commented Nov 7, 2016

The other idea -- gathering a list of companies using Django -- was briefly discussed during the Django: Under the Hood sprints but ruled out for privacy reasons.

We don't really want to know who's using Django and we're not looking forward to being responsible for a list of websites that you can hack if you find a zero-day in Django...

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Dec 5, 2016

Member

When said "emerging consensus", I was referring to how to gather metrics, not about whether to gather metrics.

Member

aaugustin commented Dec 5, 2016

When said "emerging consensus", I was referring to how to gather metrics, not about whether to gather metrics.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Dec 5, 2016

Member

Also I'd like to expand the opt-in / opt-out possibilities described by @evildmp above.

We have the following options when Django is about to collect metrics for the first time.

  1. don't tell the user and collect metrics (unless they opted-out through a separate mechanism)
  2. tell the user and collect metrics (unless they opted-out through a separate mechanism)
  3. ask the user, default to yes, and collect metrics unless they said no
  4. ask the user, force them to choose yes or no, and collect metrics if they said yes
  5. ask the user, default to no, and collect metrics if they said yes
  6. tell the user and don't collect metrics (unless they opted-in through a separate mechanism)
  7. don't tell the user and don't collect metrics (unless they opted-in through a separate mechanism)

As far as I can tell:

  • opt-out (options 1 and 2) create legal concerns in countries with strong privacy protection laws — I don't know first hand but comments in the discussion above say so.
  • opt-in (options 6 and 7) don't sound useful.
  • Debian's and Ubuntu's popcon's, which use options 5 and 3 respectively, suggest that option 5 won't give enough data. Even though Django plans to collect much less data than Debian does, I don't think that change the result significantly. The huge difference can only be explained by people who don't care going with the default choice to dismiss the prompt.
  • I find option 4 quite annoying from a UX perspective.

That's why I belive that option 3 is the best choice for Django.

[EDIT] fixed copy/paste error which made option 5 identical to option 3.

Member

aaugustin commented Dec 5, 2016

Also I'd like to expand the opt-in / opt-out possibilities described by @evildmp above.

We have the following options when Django is about to collect metrics for the first time.

  1. don't tell the user and collect metrics (unless they opted-out through a separate mechanism)
  2. tell the user and collect metrics (unless they opted-out through a separate mechanism)
  3. ask the user, default to yes, and collect metrics unless they said no
  4. ask the user, force them to choose yes or no, and collect metrics if they said yes
  5. ask the user, default to no, and collect metrics if they said yes
  6. tell the user and don't collect metrics (unless they opted-in through a separate mechanism)
  7. don't tell the user and don't collect metrics (unless they opted-in through a separate mechanism)

As far as I can tell:

  • opt-out (options 1 and 2) create legal concerns in countries with strong privacy protection laws — I don't know first hand but comments in the discussion above say so.
  • opt-in (options 6 and 7) don't sound useful.
  • Debian's and Ubuntu's popcon's, which use options 5 and 3 respectively, suggest that option 5 won't give enough data. Even though Django plans to collect much less data than Debian does, I don't think that change the result significantly. The huge difference can only be explained by people who don't care going with the default choice to dismiss the prompt.
  • I find option 4 quite annoying from a UX perspective.

That's why I belive that option 3 is the best choice for Django.

[EDIT] fixed copy/paste error which made option 5 identical to option 3.

@evildmp

This comment has been minimized.

Show comment
Hide comment
@evildmp

evildmp Dec 5, 2016

@aaugustin Thanks for breaking down the list of opt-in/opt-out options like that.

One question I have: options 3-5 only make sense when there is some kind of interaction with the user. In automated deployments, the opportunity for obtaining the user's consent would be lost.

This pushes the scope for the assent mechanism back to something else, for example the Django settings, and there we face more or less the same list of options a different context, with the additional difficulty that the giving or withholding of assent is much less obvious when it's tucked away in a settings.py.

If using settings is less satisfactory (especially for assent by default), then should we also use interactive prompt opt-in? Then we'd be maintaining two different mechanisms. Which would override the other, and how would the user or developer be sure of their state?

evildmp commented Dec 5, 2016

@aaugustin Thanks for breaking down the list of opt-in/opt-out options like that.

One question I have: options 3-5 only make sense when there is some kind of interaction with the user. In automated deployments, the opportunity for obtaining the user's consent would be lost.

This pushes the scope for the assent mechanism back to something else, for example the Django settings, and there we face more or less the same list of options a different context, with the additional difficulty that the giving or withholding of assent is much less obvious when it's tucked away in a settings.py.

If using settings is less satisfactory (especially for assent by default), then should we also use interactive prompt opt-in? Then we'd be maintaining two different mechanisms. Which would override the other, and how would the user or developer be sure of their state?

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Dec 5, 2016

Member

@evildmp

My first thought is that django-admin <command> --no-input could just go with the default, under the assumption that --no-input means that you ask for the default default choice, but that may not be considered sufficient opt-in, especially since it didn't have this effect in previous releases. Perhaps it's safer for --no-input no skip metrics collections and ask next time the command is run interactively.

Also there's the issue of backwards compatibility. AFAICT the commands currently targeted by this proposal are non-interactive. Making them interactive by default (unless --no-input is added) could be a problem for some automation scenarios (and trigger irate feedback).

Member

aaugustin commented Dec 5, 2016

@evildmp

My first thought is that django-admin <command> --no-input could just go with the default, under the assumption that --no-input means that you ask for the default default choice, but that may not be considered sufficient opt-in, especially since it didn't have this effect in previous releases. Perhaps it's safer for --no-input no skip metrics collections and ask next time the command is run interactively.

Also there's the issue of backwards compatibility. AFAICT the commands currently targeted by this proposal are non-interactive. Making them interactive by default (unless --no-input is added) could be a problem for some automation scenarios (and trigger irate feedback).

@stebunovd

This comment has been minimized.

Show comment
Hide comment
@stebunovd

stebunovd Dec 5, 2016

One more option, not sure if you considered this. If you need to get the data about Django usage, you can try to partner with companies which already have the data. This is quite common for developers to monitor their production environments with tools like Sentry or New Relic. Some people are hosting in environments like Heroku, where it's pretty easy to know their stack. Maybe they won't mind to share anonymous total stats? We could ask @dcramer for example.

Of course this won't give us absolute numbers of all Django installations in the world, because some sites are not using any monitoring at all. However if we look at the data maybe we could find something useful in it basing on relative estimates, like popularity of Django vs. framework X, and using open data about framework X get an estimate of Django installations.

Benefit of this approach - no need to add anything into Django, no need to host own infrastructure, no need to ask people to trust anyone else (Google?) besides those whom they already trusted.

stebunovd commented Dec 5, 2016

One more option, not sure if you considered this. If you need to get the data about Django usage, you can try to partner with companies which already have the data. This is quite common for developers to monitor their production environments with tools like Sentry or New Relic. Some people are hosting in environments like Heroku, where it's pretty easy to know their stack. Maybe they won't mind to share anonymous total stats? We could ask @dcramer for example.

Of course this won't give us absolute numbers of all Django installations in the world, because some sites are not using any monitoring at all. However if we look at the data maybe we could find something useful in it basing on relative estimates, like popularity of Django vs. framework X, and using open data about framework X get an estimate of Django installations.

Benefit of this approach - no need to add anything into Django, no need to host own infrastructure, no need to ask people to trust anyone else (Google?) besides those whom they already trusted.

@adamchainz

This comment has been minimized.

Show comment
Hide comment
@adamchainz

adamchainz Dec 5, 2016

Member

@aaugustin os.isatty on sys.stdout is normally a good proxy for if python is being invoked interactively (SO), so checking that as well before prompting could be an extra step to guard against the commands being used in automated scripts already.

Member

adamchainz commented Dec 5, 2016

@aaugustin os.isatty on sys.stdout is normally a good proxy for if python is being invoked interactively (SO), so checking that as well before prompting could be an extra step to guard against the commands being used in automated scripts already.

@dcramer

This comment has been minimized.

Show comment
Hide comment
@dcramer

dcramer Dec 5, 2016

We don't have any numbers off hand in Sentry but these days we probably have accurate enough data to be able to identify many things. If it's something the Django team wants we would be happy to help, though it's possible id ask the Django community to write the draft script for the answers wanted.

dcramer commented Dec 5, 2016

We don't have any numbers off hand in Sentry but these days we probably have accurate enough data to be able to identify many things. If it's something the Django team wants we would be happy to help, though it's possible id ask the Django community to write the draft script for the answers wanted.

@rafalp

This comment has been minimized.

Show comment
Hide comment
@rafalp

rafalp Dec 5, 2016

@evildmp Divio's DjangoCMS has shown message about 3.4 being out on recent project's admin week ago.

rafalp commented Dec 5, 2016

@evildmp Divio's DjangoCMS has shown message about 3.4 being out on recent project's admin week ago.

@evildmp

This comment has been minimized.

Show comment
Hide comment
@evildmp

evildmp Dec 5, 2016

@rafalp It's not django CMS doing that, it's django CMS Admin Style, an optional package.

evildmp commented Dec 5, 2016

@rafalp It's not django CMS doing that, it's django CMS Admin Style, an optional package.

@jo-sm

This comment has been minimized.

Show comment
Hide comment
@jo-sm

jo-sm Dec 5, 2016

Sorry to jump in late into the discussion. I think, regardless of what kind of tracking happens, tracking using Google is going to be more contentious than tracking by itself, and I would be very much against using any Google Analytics tracking (it would likely cause me to move to another framework). Tracking with another service, either open source or commercial, is more okay with me, and I would inquire to different analytics services since I'd be surprised if there isn't one that would offer free/reduced rates for an open source software project and initiative.

I also don't like the idea of tracking within the application itself. Tracking the manage.py/django-admin usage is one thing and I am relatively okay with it, but tracking in the application, either by the admin pages or via the runtime (unless it was only when the runtime starts and that's it) would be problematic because proprietary data could be sent: if the URL of the specific admin page was leaked, it could cause many issues within an organization that doesn't want to leak that info and trusting that the data wouldn't be tracked, especially with a service like Google Analytics, would mean that some organizations would either not upgrade to a newer version of Django, would leave to another framework, or would not choose it in the first place. I don't particularly care about my pet project leaking data from the admin pages but a bigger organization would, especially if it's in heath or the government.

Finally, will this data actually generate more fundraising opportunities for Django? I can understand the want for more data but have (potential) investors specifically stated that not having usage metrics causes them to be less open to investing? I'm curious because for a Python developer, Django is one of the "household names" and any developer would recognize it immediately and many have probably used it at one point in their development if they've ever done any web app work and so I'm surprised that the name alone wouldn't be enough for investors where data would. And I'd be doubly surprised that Pypi statistics aren't enough for investors, unless they specifically ask for usage metrics and not an easier to obtain number like downloads. In other words, is this a solution to a problem that does exist?

jo-sm commented Dec 5, 2016

Sorry to jump in late into the discussion. I think, regardless of what kind of tracking happens, tracking using Google is going to be more contentious than tracking by itself, and I would be very much against using any Google Analytics tracking (it would likely cause me to move to another framework). Tracking with another service, either open source or commercial, is more okay with me, and I would inquire to different analytics services since I'd be surprised if there isn't one that would offer free/reduced rates for an open source software project and initiative.

I also don't like the idea of tracking within the application itself. Tracking the manage.py/django-admin usage is one thing and I am relatively okay with it, but tracking in the application, either by the admin pages or via the runtime (unless it was only when the runtime starts and that's it) would be problematic because proprietary data could be sent: if the URL of the specific admin page was leaked, it could cause many issues within an organization that doesn't want to leak that info and trusting that the data wouldn't be tracked, especially with a service like Google Analytics, would mean that some organizations would either not upgrade to a newer version of Django, would leave to another framework, or would not choose it in the first place. I don't particularly care about my pet project leaking data from the admin pages but a bigger organization would, especially if it's in heath or the government.

Finally, will this data actually generate more fundraising opportunities for Django? I can understand the want for more data but have (potential) investors specifically stated that not having usage metrics causes them to be less open to investing? I'm curious because for a Python developer, Django is one of the "household names" and any developer would recognize it immediately and many have probably used it at one point in their development if they've ever done any web app work and so I'm surprised that the name alone wouldn't be enough for investors where data would. And I'd be doubly surprised that Pypi statistics aren't enough for investors, unless they specifically ask for usage metrics and not an easier to obtain number like downloads. In other words, is this a solution to a problem that does exist?

@Ian-Foote

This comment has been minimized.

Show comment
Hide comment
@Ian-Foote

Ian-Foote Dec 6, 2016

@LegoStormtroopr without more detail my answer to many of your questions in the survey is "I don't know".

Ian-Foote commented Dec 6, 2016

@LegoStormtroopr without more detail my answer to many of your questions in the survey is "I don't know".

@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa Dec 6, 2016

@LegoStormtroopr ❤️ Thanks for taking a constructive approach here.

Lukasa commented Dec 6, 2016

@LegoStormtroopr ❤️ Thanks for taking a constructive approach here.

@apollo13

This comment has been minimized.

Show comment
Hide comment
@apollo13

apollo13 Dec 6, 2016

Member

@LegoStormtroopr The results look all nice and well (to some extend, I am currently having a hard time grasping why the second diagram has less responses -- does google collect partial answers?), but please provide raw access to the data (of all questions), otherwise the usability of this data is quite limited (to quote British Prime Minister Benjamin Disraeli: "There are three kinds of lies: lies, damned lies, and statistics." -- please don't read to much into that sentence though, I am not saying you are lying, I'd just would like to make a picture for myself out of raw data instead of diagrams which are somewhat biased to what you want to show).

I also disagree with the math to calculate the results, this is a highly optimistic calculation. Further more, what you easily clarify as "50% overhead" is where it comes really important. There are legal ramnifications to consider as well as how to run the program and how to do the certifications. Without a clear proposal there, your suggestion is not going to provide a viable alternative.

Member

apollo13 commented Dec 6, 2016

@LegoStormtroopr The results look all nice and well (to some extend, I am currently having a hard time grasping why the second diagram has less responses -- does google collect partial answers?), but please provide raw access to the data (of all questions), otherwise the usability of this data is quite limited (to quote British Prime Minister Benjamin Disraeli: "There are three kinds of lies: lies, damned lies, and statistics." -- please don't read to much into that sentence though, I am not saying you are lying, I'd just would like to make a picture for myself out of raw data instead of diagrams which are somewhat biased to what you want to show).

I also disagree with the math to calculate the results, this is a highly optimistic calculation. Further more, what you easily clarify as "50% overhead" is where it comes really important. There are legal ramnifications to consider as well as how to run the program and how to do the certifications. Without a clear proposal there, your suggestion is not going to provide a viable alternative.

@shaib

This comment has been minimized.

Show comment
Hide comment
@shaib

shaib Dec 6, 2016

Member
Member

shaib commented Dec 6, 2016

@apollo13

This comment has been minimized.

Show comment
Hide comment
@apollo13

apollo13 Dec 6, 2016

Member
Member

apollo13 commented Dec 6, 2016

@keimlink

This comment has been minimized.

Show comment
Hide comment
@keimlink

keimlink Dec 6, 2016

The collection of metrics could be limited for a specific period of time. After that time it will be validated to figure out if the metrics did help raising new funds. Then it can be decided to continue with collecting the metrics or not.

If this is something that we will do the following points should be added to the proposal:

  1. The time after the effect of collecting metrics on fundraising will be evaluated.
  2. The main criteria that will be used to evaluate the success.
  3. Who decides about the continuation of collecting the metrics.

keimlink commented Dec 6, 2016

The collection of metrics could be limited for a specific period of time. After that time it will be validated to figure out if the metrics did help raising new funds. Then it can be decided to continue with collecting the metrics or not.

If this is something that we will do the following points should be added to the proposal:

  1. The time after the effect of collecting metrics on fundraising will be evaluated.
  2. The main criteria that will be used to evaluate the success.
  3. Who decides about the continuation of collecting the metrics.
@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Dec 6, 2016

anarcat commented Dec 6, 2016

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Dec 7, 2016

Member

@anarcat Of course this isn't a contest. Escalation should be avoided.

I think that aggressive or out-of-place comments should get a firm response. I don't think insults should be answerd with insults.

Member

aaugustin commented Dec 7, 2016

@anarcat Of course this isn't a contest. Escalation should be avoided.

I think that aggressive or out-of-place comments should get a firm response. I don't think insults should be answerd with insults.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Dec 11, 2016

Member

Atom is another example of prior art, which many Django devs may be using already.

See https://github.com/atom/metrics for details.

I don't remember drama about this, perhaps because it had metrics built-in from day one.

Member

aaugustin commented Dec 11, 2016

Atom is another example of prior art, which many Django devs may be using already.

See https://github.com/atom/metrics for details.

I don't remember drama about this, perhaps because it had metrics built-in from day one.

@wfdd

This comment has been minimized.

Show comment
Hide comment
@wfdd

wfdd Dec 12, 2016

Atom metrics were made opt-in in a recent release. The first time you run Atom it asks you if you wanna enable metrics. Ironically, your response is logged even if you decide not to. See atom/atom#4966 and atom/atom#12281.

wfdd commented Dec 12, 2016

Atom metrics were made opt-in in a recent release. The first time you run Atom it asks you if you wanna enable metrics. Ironically, your response is logged even if you decide not to. See atom/atom#4966 and atom/atom#12281.

@aaugustin

This comment has been minimized.

Show comment
Hide comment
@aaugustin

aaugustin Jan 21, 2017

Member

As a reference point, here's how the Google Cloud SDK achieves the same goal:

myk@mYk:/usr/local $ ./google-cloud-sdk/install.sh
Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true

Do you want to help improve the Google Cloud SDK (Y/n)?
Member

aaugustin commented Jan 21, 2017

As a reference point, here's how the Google Cloud SDK achieves the same goal:

myk@mYk:/usr/local $ ./google-cloud-sdk/install.sh
Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true

Do you want to help improve the Google Cloud SDK (Y/n)?
@nemesisdesign

This comment has been minimized.

Show comment
Hide comment
@nemesisdesign

nemesisdesign Mar 28, 2017

This kind of feature would be very useful for open-source projects built with django too.
If this feature could be generalized so that open-source projects could re-use it to collect data, then those project could forward it to django. This way we would also be able to know what are the most widely deployed open source applications built with django.

nemesisdesign commented Mar 28, 2017

This kind of feature would be very useful for open-source projects built with django too.
If this feature could be generalized so that open-source projects could re-use it to collect data, then those project could forward it to django. This way we would also be able to know what are the most widely deployed open source applications built with django.

We believe that we've struck a balance that lets us gather the data we need
for sustainability while respecting our users' privacy. And, it'll always
be possible for users to disable this metric collection. We're hoping the vast

This comment has been minimized.

@zachborboa

zachborboa Apr 20, 2017

s/possible/possible and simple/

@zachborboa

zachborboa Apr 20, 2017

s/possible/possible and simple/

@buddylindsey

This comment has been minimized.

Show comment
Hide comment
@buddylindsey

buddylindsey Apr 21, 2017

I just wanted to leave this here as some further information of how others handle this. I was doing some goofing around with .NET Core and ran their new command line tools. When I went to do a dotnet new to create a new .NET Project I got the following message:

Telemetry
--------------
The .NET Core tools collect usage data in order to improve your experience. The data is anonymous and does not include command-line arguments. The data is collected by Microsoft and shared with the community.
You can opt out of telemetry by setting a DOTNET_CLI_TELEMETRY_OPTOUT environment variable to 1 using your favorite shell.
You can read more about .NET Core tools telemetry @ https://aka.ms/dotnet-cli-telemetry.

Here is a clickable link: https://aka.ms/dotnet-cli-telemetry

.NET Core is open source so it is an open source project gathering data. By running it you are notified that by default it is running.

buddylindsey commented Apr 21, 2017

I just wanted to leave this here as some further information of how others handle this. I was doing some goofing around with .NET Core and ran their new command line tools. When I went to do a dotnet new to create a new .NET Project I got the following message:

Telemetry
--------------
The .NET Core tools collect usage data in order to improve your experience. The data is anonymous and does not include command-line arguments. The data is collected by Microsoft and shared with the community.
You can opt out of telemetry by setting a DOTNET_CLI_TELEMETRY_OPTOUT environment variable to 1 using your favorite shell.
You can read more about .NET Core tools telemetry @ https://aka.ms/dotnet-cli-telemetry.

Here is a clickable link: https://aka.ms/dotnet-cli-telemetry

.NET Core is open source so it is an open source project gathering data. By running it you are notified that by default it is running.

@benjaoming

This comment has been minimized.

Show comment
Hide comment
@benjaoming

benjaoming Jun 20, 2018

*cough* GDPR *cough*

benjaoming commented Jun 20, 2018

*cough* GDPR *cough*

@jarshwah

This comment has been minimized.

Show comment
Hide comment
@jarshwah

jarshwah Jun 20, 2018

Member

@benjaoming this DEP hasn't been updated or commented on in over a year, so I don't think it has the traction to actually go anywhere. However, will GDPR even apply?

I'm not very familiar with it, but doesn't it only apply to businesses operating within the EU? And even if it did apply, I would think the only requirement would be for consent, is that correct?

Member

jarshwah commented Jun 20, 2018

@benjaoming this DEP hasn't been updated or commented on in over a year, so I don't think it has the traction to actually go anywhere. However, will GDPR even apply?

I'm not very familiar with it, but doesn't it only apply to businesses operating within the EU? And even if it did apply, I would think the only requirement would be for consent, is that correct?

@dcramer

This comment has been minimized.

Show comment
Hide comment
@dcramer

dcramer Jun 21, 2018

While this isn't legal advice, I did run GDPR for Sentry, so consider these my informed opinions.

It generally doesn't apply unless you're collecting some kind of identifying information (which is how its also associated with tracking cookies). Even more importantly, these kinds of stats are often opt-in, which would satisfy consent needs under GDPR in the cases that it does contain e.g. contact information. These policies dont usually apply to businesses, but there's a fuzzy connection on if that's the use of Django (you could ask, just like you could ask for consent).

With that said I don't think we should use this ticket tracker as a debate on GDPR politics, and if it does get implemented, the maintainers should simply ensure privacy controls are present and up to standards.

dcramer commented Jun 21, 2018

While this isn't legal advice, I did run GDPR for Sentry, so consider these my informed opinions.

It generally doesn't apply unless you're collecting some kind of identifying information (which is how its also associated with tracking cookies). Even more importantly, these kinds of stats are often opt-in, which would satisfy consent needs under GDPR in the cases that it does contain e.g. contact information. These policies dont usually apply to businesses, but there's a fuzzy connection on if that's the use of Django (you could ask, just like you could ask for consent).

With that said I don't think we should use this ticket tracker as a debate on GDPR politics, and if it does get implemented, the maintainers should simply ensure privacy controls are present and up to standards.

@jacobian

This comment has been minimized.

Show comment
Hide comment
@jacobian

jacobian Jun 21, 2018

Member

I don't have the energy to pursue this any longer, so as the person who started this I'll go ahead and close it. It's frustrating that even the most mild attempt to collect usage data results in such vitriol, but here we are. I really wish we had better information about who used Django, and how, but I'm just not willing to fight about it.

Member

jacobian commented Jun 21, 2018

I don't have the energy to pursue this any longer, so as the person who started this I'll go ahead and close it. It's frustrating that even the most mild attempt to collect usage data results in such vitriol, but here we are. I really wish we had better information about who used Django, and how, but I'm just not willing to fight about it.

@jacobian jacobian closed this Jun 21, 2018

@benjaoming

This comment has been minimized.

Show comment
Hide comment
@benjaoming

benjaoming Jun 21, 2018

Times have changed is what I meant by tossing in GDPR, and as a bystander (reading this for the first time also), I was hoping that the DEP would be closed & rebooted because of that.

Good call @jacobian and amazing spirit about getting this far into the discussion and having so many respectable opinions in one DEP!

A lot of the thoughts and ideas from almost 2 years ago about for instance opt-out will likely not be presented the same way again today, post-GDPR. But the discussion was great and very enlightening. So that could serve to inform another DEP that perhaps is more of a "minimal set of acceptable, useful, GDPR + Debian Policy compliant analytics (opt-in)".

I wouldn't mind starting that work, if people are interested in a more modest approach?

benjaoming commented Jun 21, 2018

Times have changed is what I meant by tossing in GDPR, and as a bystander (reading this for the first time also), I was hoping that the DEP would be closed & rebooted because of that.

Good call @jacobian and amazing spirit about getting this far into the discussion and having so many respectable opinions in one DEP!

A lot of the thoughts and ideas from almost 2 years ago about for instance opt-out will likely not be presented the same way again today, post-GDPR. But the discussion was great and very enlightening. So that could serve to inform another DEP that perhaps is more of a "minimal set of acceptable, useful, GDPR + Debian Policy compliant analytics (opt-in)".

I wouldn't mind starting that work, if people are interested in a more modest approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment