DEP 8: Gathering Django usage analytics #31

jacobian · 2016-11-05T16:48:06Z

I want to start collecting some basic usage metrics so that it's easier for the DSF to raise money.

TODO before merging to master:

reference implementation
list prior art
clarify the cid bit and why we couldn't track a user even if we knew their id
other things I'm probably forgetting

ericholscher · 2016-11-05T16:59:27Z

draft/0008-gathering-usage-metrics.rst

+Google Analytics vs other platforms/choices
+-------------------------------------------
+
+Using Google Analytics is a trade-off. On the one hand, Google's track record 


Might I suggest running a proxy that sends this data along to GA? That way you can change to an API compatible endpoint in the future, without breaking deployed code. It would require running a proxy on your infra, but that is much less demanding than a full analytics install.

It's something I considered, and then discarded under the same reasoning as not running our own choice: I don't want to increase maintenance burden. That said, there are some good reasons to think about a proxy: the one you mentioned, as well as that it'll let us strip out the IP address which addresses the single remaining GA privacy concern. So might be worth thinking further here.

I'd be much in favor of a proxy of that kind.

The proxy part is really interesting.
It helps in the future.

On another side it increases the increase maintenance burden as well.

ericholscher · 2016-11-05T17:00:28Z

This Beacon implementation in Sentry is one we've been thinking about adding to Read the Docs, for similar reasons; https://github.com/getsentry/sentry/blob/bfc711ed2579d8588f99170c75d974af3d4c8e96/src/sentry/tasks/beacon.py#L32 -- it's a bit of different idea, but is good prior art.

Of note, it also allows sending a response that includes a message -- which could be useful for security notices. This is probably out of scope for the Django implementation, but might be another added user benefit of "phoning home" in dev.

alexwlchan

A few very minor spelling/grammar suggestions, but otherwise this seems like a pretty sensible proposal. 👍

alexwlchan · 2016-11-05T16:59:48Z

draft/0008-gathering-usage-metrics.rst

+much easier to approach organizations for funding. As Eghbal writes:
+
+    [W]ithout data about which tools are used, and how much we rely upon them,
+    [it is hard to paint a clear picture of what is underfunded.


Extra square bracket has crept in.

alexwlchan · 2016-11-05T17:00:26Z

draft/0008-gathering-usage-metrics.rst

+
+Analytics will be sent when certain ``django-admin`` commands are run:
+``startproject``, ``startapp``, and ``runserver``. If a settings file
+can be loaded (i.e. for``startapp`` and ``runserver``), analytics will only


Missing space between for and startapp.

alexwlchan · 2016-11-05T17:01:50Z

draft/0008-gathering-usage-metrics.rst

+How will analytics be sent?
+---------------------------
+
+Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard


Minor: should this be "HTTPS"?

dstufft · 2016-11-05T17:06:36Z

I mentioned this to @jacobian in IRC but I figured I'd mention it here as well.

While it skews somewhat towards "number of downloads (not installs, downloads)" if there's something that can be added to pip or PyPI to aid in this goal I'm definitely interested in it. Of course we have the same privacy goals there as well, but if there's things that can be added on that front to help Django (and other projects) we can absolutely make something happen there. I've wanted to do it for awhile and I've just lacked time.

shaib · 2016-11-05T17:44:30Z

@dstufft Every time you install Debian, it asks you if you want to participate in a "popularity contest" which reports home which packages are installed. I guess we could add something like that to virtualenv. The package in Debian which takes care of this is called popcon.

aaugustin · 2016-11-05T19:37:56Z

draft/0008-gathering-usage-metrics.rst

+  the Google Analytics application name (`aid`_) and application version (`av`_).
+
+- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
+  This is generated by the Python standard library function ``uuid.uuid1()`` and


You could use uuid4 instead to further minimize privacy concerns.

arikfr · 2016-11-05T19:58:57Z

draft/0008-gathering-usage-metrics.rst

+Who has access to analytics data?
+---------------------------------
+
+Access to the Google Analytics dashboard and data will be limited to the 


Is it limited on purpose or because of the tool (GA)?

By design the data isn't supposed to contain any private data, so is there a reason not to share it with the public, assuming it's possible?

Yes: it's a "defense in depth" sort of thing. Privacy is a real concern here, and I want to put as many controls around respecting privacy as possible. If we've somehow got something wrong and there's a way to de-anonymize the raw data, I want to limit the number of people who might be able to do so. Sharing summaries and reports is totally something I hope we'll do, but keeping access to the raw data as restricted as is reasonable seems like a good idea.

shaib · 2016-11-06T07:21:54Z

I think these can work much better if we don't try to do this on our own, but in coordination with other important open-source forces. I would love to find out what Debian and Fedora think of this; if they object, I suspect it could turn out ugly.

For the record, Debian's popcon is opt-in, but the installer makes sure to present the option. I'd feel much better about this if we can do something similar.

raphaelm · 2016-11-06T09:05:15Z

@aaugustin I'm at the sprints today, but not the package manager for debian, I'm not even a debian developer. You probably confused me with @rhertzog who is listed on the debian page together with @lfaraone and @brianmay.

However, I know that debian in the past patched these kind of things out of packages, I just can't remember what packages that were from the top of my head.

aaugustin · 2016-11-06T09:33:19Z

If major distros stripped the analytics, it would be a shame. The packager for Debian, Raphael Michel (@raphaelm), was at the DUTH sprints yesterday. Jacob, if you're still there, perhaps you can talk to him?

The order of magnitude would likely remain correct, though, due to virtualenv and pip being the dominant installation method.

shaib · 2016-11-06T09:35:22Z

@aaugustin the technical differences might be minor, but the major distros' positions would be very important for public reception.

shaib · 2016-11-06T09:42:20Z

Another thought: if we're going there, we should probably be collecting usage of 3rd-parties -- at first I thought "apps", but perhaps more than that.
For apps, we could add a flag in AppConfig, defaulting to "no", which says "I want this app reported in the Django PopCon".
For other packages, this would probably require some setuptools API; I'm imagining some way for a "main" to ask its environment for a list of packages who want to be reported, and a way for a package to specify if it wants to be reported. Django could then use this API to do a real popcon. And I can easily see others using the same -- IPython & Anaconda are the first to come to mind.
@dstufft , thoughts?

aaugustin · 2016-11-06T16:52:19Z

@raphaelm: sorry, I was thinking of Raphael Hertzog. Mea culpa!

rbarrois · 2016-11-06T21:04:26Z

draft/0008-gathering-usage-metrics.rst

+they couldn't track Django users. As far as we can tell, the only thing Google
+could do would be to lie about anonymizing IP addresses, and attempt to match
+users based on their IPs. If we discovered Google was lying about this,
+we'd obviously stop using them immediately.


Unless passing the stats through a DSF-operated proxy, how would you "stop using [Google] immediately"?
By that time, there would be thousands of installed Django projects still sending analytics to Google.

rbarrois · 2016-11-06T21:11:00Z

Could you expand on the rules for sending events w.r.t project × developer combinations?

For a given team project, when should events be sent:

Once for the project, no matter how many developers
Once per developer on the team
As many times as developers start the runserver command?

The last one might raise more privacy issues — it is likely run quite often throughout the development cycle, whereas startapp / startproject are run much less often.

tomdyson · 2016-11-06T22:49:11Z

draft/0008-gathering-usage-metrics.rst

+as much as we want this data, collecting it is simply too invasive.
+
+Another option would be to collect data on the admin usage (e.g. embedding
+Google Analytics directly). A couple of Django projects (Wagtail and Oscar)


Wagtail doesn't embed Google Analytics, and although you haven't said this exactly, some readers could mistakenly conclude that it does. Would you mind removing the "(e.g. embedding Google Analytics directly)" clause to clarify this?

In case it's useful, Wagtail's approach is as follows:

for 'administrator' users (i.e. not standard editors) a non-blocking, client-side request is made to a text file on https://releases.wagtail.io, which is a CloudFront distribution

the text file contains the details of the latest stable version. If the current version is lower than the latest stable version, the administrator is alerted to the possibility of an upgrade

a script runs periodically to parse the CloudFront logs and report previously unseen domains (the referrer of the request)

it's opt-out. Site implementers can disable this behaviour with a documented settings config

We decided against the Google Analytics approach because we thought users would be nervous about enabling the transmission of server-to-server information, and because we only wanted to record the minimum useful usage data.

thomasgoirand · 2016-11-07T10:24:05Z

Hi,
Debian will, for sure, consider this as a privacy breach, and remove the code. If this isn't done, I'll file a bug, as for me, privacy is an important aspect of Debian. I would strongly recommend against hard-wiring this type of code without a fail-safe (ie: some easy way to disable it), and preferably, have it disabled by default.

As for my personal opinion about this, I'm also strongly against providing any information to Google. Not only this is a privacy breach, but also potentially a security problem (ie: making a query to Google may potentially inform about the version of Django that is running, which effectively can lead attackers to know what version of Django is running on the disclosed IP address).

Instead of this, why don't you just do a survey, and advertise about it in the manage.py command? Any form of advertising for such survey should be fine, provided that it doesn't do a privacy breach.

jezdez

I've added some inline comments to specific parts of this proposal.

In short, I'm in favor of collecting usage metrics (and we should use the term "metrics", not "analytics" for terminology and trademark reasons), but need to be way more defensive to protect the privacy of our users. As such I propose:

either use a proxy to never hit Google Analytics directly and to filter out data that we don't want at all
or build the analysis tooling ourselves (e.g. using Re:dash or Apache Airflow)
opt-out by default with a smarter prompt to enable it when needed
legal review and drafting of privacy statements to cover for the transfer of data to a 3rd party under US jurisdiction
adopt something like Mozilla's Data Privacy Principles: https://www.mozilla.org/en-US/privacy/principles/
make "Datenvermeidung und Datensparsamkeit" (principles of data reduction and data economy) a topic every Django team member understands and practices

jezdez · 2016-11-07T09:29:00Z

draft/0008-gathering-usage-metrics.rst

@@ -0,0 +1,306 @@
+=======================================
+DEP 8: Gathering Django usage analytics


s/analytics/metrics/g

jezdez · 2016-11-07T09:29:36Z

draft/0008-gathering-usage-metrics.rst

+Specification
+=============
+
+Starting in version XXX, Django gathers anonymous user analytics and report


s/anonymous user analytics/anonymous usage metrics/g

jezdez · 2016-11-07T09:34:07Z

draft/0008-gathering-usage-metrics.rst

+---------------------------
+
+Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard
+library.


This should only happen via urllib2 on Python >= 2.7.9 or any other version that has the backported TLS cert validation feature. This should also say that the data is sent encrypted to Google Analytics.

I don't expect that feature to land before Django drops support for Python 2 and I don't think it'll be backported to earlier releases.

This should not happen in China at all (Google is blocked, users will see ugly timeouts)

jezdez · 2016-11-07T09:34:21Z

draft/0008-gathering-usage-metrics.rst

+Access to the Google Analytics dashboard and data will be limited to the 
+following people/groups:
+
+- The DSF President, in their role providing oversight to the DSF.


Not the board in general?

I would suggest adding:

Members of the Django Software Foundation Board, upon application.

jezdez · 2016-11-07T09:35:15Z

draft/0008-gathering-usage-metrics.rst

+
+- Members of the Django Technical Board, upon request.
+
+- Members of the Django Infrastructure Team (so they can maintain the GA 


s/Django Infrastructure Team/Django Ops team/g

jezdez · 2016-11-07T09:36:19Z

draft/0008-gathering-usage-metrics.rst

+
+Users can disable analytics collection in two ways:
+
+1. By setting an environment variable: ``export DJANGO_NO_ANALYTICS=1``.


Double negative, let's use DJANGO_USAGE_METRICS=0

While I agree the double negative is annoying, I think it'd be easier if the API was "set the env var to anything not empty to change the behavior". What about DJANGO_DISABLE_METRICS=...?

@aaugustin The problem with that technique (seen this with pip's env var support) is when people set DJANGO_DISABLE_METRICS=no and expect it to work.

Personally I'd like to see a "defacto" standard emerge for disabling tracking across a number of tools that have begun adding analytics. It'd be far easier for privacy minded people to set a single env var, and know that a number of tools will honour it, without having to set a new one for each tool. I'm not opposed to a django specific one, especially if setting it to a specific value enabled tracking, but I'd be in favour of a general fall back key.

ANALTICS_DISABLED="whatever" # force no collection, hopefully other tools will begin honouring too. All values will disable. DJANGO_NO_ANALTICS= # 0 enables analytics, every other value disables.

jezdez · 2016-11-07T09:38:21Z

draft/0008-gathering-usage-metrics.rst

+-- which raise some of the same concerns as above. So, we choose to only measure
+upon certain commands that we can feel fairly certain won't be run in production.
+This runs the risk of undercounting, but we think this is the best option.
+


What about commands such as collectstatic that may or may not be used in development but could provide useful information about the usefulness of Django features, especially when deciding the fate of contrib apps?

jezdez · 2016-11-07T09:55:35Z

draft/0008-gathering-usage-metrics.rst

+
+We believe that collecting data by default is the only way we'll get a roughly
+accurate measure of Django's usage.
+


I'm strongly opposed to enabling this by default, even "just for developers".

Doing so would risk an important step in building trust to new Django users and destroying existing trust with old Django users when informing them of the data collection upon the first call of a management command.

While this DEP describes an option to opt-out, it's a high-enough barrier that only the privacy minded users will take those extra steps to actual disable the collection. Since our goal as an Open Source project should be not to sacrifice the essential right for privacy (philosophically speaking, not politically) of our users we need to let them decide if (and maybe when) to enable data collection, however anonymized it may be.

Instead what I would suggest is to prompt the user after a few calls to the previously described management commands (and assuming not using --noinput in those calls) whether to enable data collection or not. That prompt should have a short overview of what data is collected and sent to Google Analytics as an example and a link to more elaborate documentation similar to https://www.mozilla.org/en-US/privacy/firefox/.

The prompt should only show up after a few calls to the management commands (e.g. 5) to prevent disturbing "the first impression" of new users -- nothing is more of a downer as being asked to answer prompts if you're deep in a tutorial learning. The tutorial documentation should be amended to mention the possibility of this prompt.

I'm in favour of an opt out system, mainly for the reasons in the DEP.

Most people don't care about tracking, and will not change the defaults no matter what.

Those that do care about tracking will specifically disable it.

Some small portion of users (true fans) will enable tracking.

If this were to be an opt-in system, it's totally useless. You miss the analytics of the largest group of users who either don't mind, or don't care enough to mind. Opt in systems don't work. Defaults matter. There will be a small, extremely vocal, group of privacy advocates who will swear off Django and call us all sorts of names. We have to weigh up the cost/benefit ratio and decide if that matters. If it does, then I wouldn't even bother with an Opt In system.

In my opinion, the user must explicitly consent with the collection of data. This may be through a prompt of 'Do you agree on sending some basic info to Django for statistical reasons', but the user has to explicitly agree on the sending of data.
You cannot just start collecting data that you (the Django project) use for your own gains without the consent of the user.

jezdez · 2016-11-07T10:10:59Z

draft/0008-gathering-usage-metrics.rst

+one out there.
+
+We've carefully chosen what to send to GA so that even if Google turns evil
+they couldn't track Django users. As far as we can tell, the only thing Google


With respect to you, "if Google turns evil" is a pretty handwavy thing to say. Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services, so I think we should acknowledge that if we send data to Google it's to an entity that is under such jurisdiction. So this isn't about "evilness" but simply about the legal framework under which Google works, which a global audience such as the Django users need to understand and acknowledge when using Django while it sends the metrics.

I would argue that for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option. E.g. there exists the legal obligation to note the use of data analysis tools such as Google Analytics in the "imprint", terms of service and privacy statements (Datenschutzrichtlinie) of websites.

@jezdez Do you know if fully anonymized metrics still fall under privacy protection laws? Perhaps because the IP address is considered personal data? (I'm not sure about the latest developments on whether IP addresses are personal data.)

Also this raises the question if the mere possibility of getting the IP address from the network connection is an issue. Since don't store it, I'm not sure it's a problem.

IP addresses are PII when they are linked to a person verifiably, e.g. they are static and not dynamically set like with dial-up and some forms of broadband connections. And even then since ISPs store IP addresses, especially with recent data retention laws coming into effect, they are linked to customer records and are linked indirectly that way. Either way, it's a grey area, at which the safest bet for Django and our users is to not store them at all (data reduction principle).

Some more details about this: https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/

The discussion seems to be converging on the use of some sort of proxy, which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves.

If we strip them entirely, we lose the ability to analyze the geographical distribution of Django users.

rhertzog · 2016-11-07T10:49:04Z

Hello, as a Debian packager, I can say that this would likely be disabled by default in Debian.

I would rather suggest something more visible and more intrusive: if ~/.local/django/developer-id does not exist, then the strartproject/startapp/runserver run a new command "registerdeveloper" which invites the developer to register in some way. He can just decide to share its existence throught a random uuid assigned to him (and stored in the above file) or he can share more if he wishes (I'll let you figure out what interests you).

For scripting purpose, you could do "django-admin registerdeveloper --disable" so that you are not bugged with the interactive questions when you don't want them.

Associated to the developer id, there could be a timestamp so that each year the developer is invited to update/expand its entry.

raphaelm · 2016-11-07T10:55:38Z

Just so that this has been mentioned: Besides the privacy issue, we also need to make sure that this call is done in a non-blocking way. It is not acceptable that this makes runserver slower to use when for example developing while tethering over a bad cellular connection (which I often do).

adamchainz · 2016-11-07T11:00:51Z

The DEP states:

potential sponsors always ask for data

and then:

A major reason that fundraising remains difficult is our inability to measure
the size of the Django community.

and then:

Our goal is to try to measure "unique developers"

However in-between there isn't much discussion as to whether 'community size'
is really what sponsors want to see, or that other metrics than 'unique
developers' have been considered. There are a few bracketed clauses about why
some kinds of metrics aren't useful ("e.g. number of times Django's been
installed"), but it doesn't clearly explain to me why tracking developers has
been settled upon.

One alternative would be to estimate what percentage of the top N websites (by
estimated traffic) use Django, using simple signals such as the presence of
/admin/ or the CSRF token. This is public data by virtue of the site being
online.

In fact, with a few minutes of Googling I can find two websites that already
use such techniques to track usage of Django (plus other tools) -
Siftery tracks 1923 sites and
Builtwith tracks 46,000.

I'd like to see the DEP consider several metrics (including non-invasive
profiling), and then justify its the choice of tracking developers.

adamchainz · 2016-11-07T11:04:32Z

draft/0008-gathering-usage-metrics.rst

+
+- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
+  This is generated by the Python standard library function ``uuid.uuid1()`` and
+  stored in ``~/.config/djangoanalytics`` (or equivalent on non-Linux


I work on several Django projects mounted in VM's, and rebuild these VM's every few weeks. With this tracking scheme I would be counted as a new developer once for each project and every time I rebuild any VM. This problem would be compounded with Docker setups where developers might rebuild their containers several times a day.

willingc · 2016-11-07T11:16:51Z

draft/0008-gathering-usage-metrics.rst

+Rationale
+=========
+
+The high-level rationale is explained in the `Abstract`_: gather data that  we


Minor edit: remove extra space between that and we

aaugustin · 2016-11-07T11:17:43Z

There's little point in gathering metrics if they're disabled by default. Most people will stick with the default (basically hit enter-enter-enter-enter until Django stops asking questions and move on). Then we're back to the kind of results a community survey would provide. We already know that's insufficient to convince potential sponsors (as explained in the DEP).

Thomas Goirand's comment suggests that vocal activits will do whatever needed to make downstream distributors strip the metrics code, for philosophical reasons that aren't open for discussion. As far as I can tell, it's not about how Django gathers metrics, it's just about the principle of gathering metrics. How that hurts Free Software isn't a consideration. I'm afraid it's a holy war (dare I say jihad?) and it will be hard to escape.

Django's liberal licensing allows distributors to change the code. Certainly we don't want to get into a trademark fight à la Firefox vs. Iceweasel over this. In my opinion that'll just be another reason to encourage users not to install Django with system packages. We already believe virtualenv/pyvenv is the better option for technical and practical reasons.

It's sad that we'll sound like we criticise distributors when we have to explain why metrics are underevaluated. I guess that's life.

On a more constructive note, I think that describing more precisely what we want to measure and giving more control to developers who want to trigger metrics in the right circumstances would be a good thing.

rhertzog · 2016-11-07T11:33:58Z

@aaugustin There's some truth in what you say, but please don't use big words like jihad. Debian with its policy provides security to its user because precisely we bring some third-party review between upstreams and themselves.

Phoning home is bad enough, but if you opt to use Google, then I'm pretty sure it won't fly. It's a matter of trust... Google will have access to data that they claim they will not store. Django has no way to verify that.

I made a suggestion a few minutes ago, I would like to hear your thoughts on it instead of being considered as a member of an extremist project.

I am particularly interested by the problem of funding free software and I agree with you that the metrics would be useful. But even more useful would be a database of Django developers and Django-using companies. So why not build a service to manage such a database and make it trivial to feed that database from the command line client? Such a service would be useful not only to Django but to the wider free software ecosystem. I would likely (try to) deploy it for Debian too...

adamchainz · 2016-11-07T11:43:05Z

a database of Django developers and Django-using companies

N.B. there's actually such a database already at https://www.djangosites.org/ listing 5179 websites, though it's a third-party project for which the code hasn't been updated since 2013.

aaugustin · 2016-11-07T12:20:38Z

@rhertzog I appreciate that you're open to discussion and constructive. For the avoidance of doubts, I believe that the majority of Debian Developers are also willing to have a productive discussion. It's less clear to me that the majority can prevail on such discussions on the Internet; that's the risk I'm afraid we have to prepare for.

Regarding your suggestion, I agree that we need to give specific examples of what the user experience looks like. I'm less enthusiastic about the idea of "registering"; you shouldn't have to register anywhere to use Django.

The current proposal says:

Metric collection will be enabled by default, and users will be warned the first time they run django-admin.

Perhaps it could look like this?

$ django-admin startproject foobar
This is the first time you're running django-admin on this system.

django-admin anonymously reports your version of Python and Django when you
run startproject, startapp, or runserver. These metrics allow the Django
Software Foundation to estimate the size of the community and to raise money
from sponsors. For details, see https://.../.

As a small contribution to the DSF, we ask you to accept reporting these
metrics. If you prefer, you can also donate to the DSF here: https://.../.

To disable metrics, set the DJANGO_DISABLE_METRICS environment variable to 1.

Continue? [Y/n]

@rhertzog: is that the sort of "more visible and more intrusive" thing you had in mind?

@jezdez: do you think that would be a sufficient opt-in to alleviate that part of your concerns?

I can't say if that's compatible with what @jacobian has in mind. We'll need his input when he's back from Django: Under the Hood and has caught up with the comments.

aaugustin · 2016-11-07T12:26:02Z

The other idea -- gathering a list of companies using Django -- was briefly discussed during the Django: Under the Hood sprints but ruled out for privacy reasons.

We don't really want to know who's using Django and we're not looking forward to being responsible for a list of websites that you can hack if you find a zero-day in Django...

evildmp · 2016-12-05T15:40:58Z

@aaugustin Thanks for breaking down the list of opt-in/opt-out options like that.

One question I have: options 3-5 only make sense when there is some kind of interaction with the user. In automated deployments, the opportunity for obtaining the user's consent would be lost.

This pushes the scope for the assent mechanism back to something else, for example the Django settings, and there we face more or less the same list of options a different context, with the additional difficulty that the giving or withholding of assent is much less obvious when it's tucked away in a settings.py.

If using settings is less satisfactory (especially for assent by default), then should we also use interactive prompt opt-in? Then we'd be maintaining two different mechanisms. Which would override the other, and how would the user or developer be sure of their state?

aaugustin · 2016-12-05T16:15:31Z

@evildmp

My first thought is that django-admin <command> --no-input could just go with the default, under the assumption that --no-input means that you ask for the default default choice, but that may not be considered sufficient opt-in, especially since it didn't have this effect in previous releases. Perhaps it's safer for --no-input no skip metrics collections and ask next time the command is run interactively.

Also there's the issue of backwards compatibility. AFAICT the commands currently targeted by this proposal are non-interactive. Making them interactive by default (unless --no-input is added) could be a problem for some automation scenarios (and trigger irate feedback).

stebunovd · 2016-12-05T16:17:42Z

One more option, not sure if you considered this. If you need to get the data about Django usage, you can try to partner with companies which already have the data. This is quite common for developers to monitor their production environments with tools like Sentry or New Relic. Some people are hosting in environments like Heroku, where it's pretty easy to know their stack. Maybe they won't mind to share anonymous total stats? We could ask @dcramer for example.

Of course this won't give us absolute numbers of all Django installations in the world, because some sites are not using any monitoring at all. However if we look at the data maybe we could find something useful in it basing on relative estimates, like popularity of Django vs. framework X, and using open data about framework X get an estimate of Django installations.

Benefit of this approach - no need to add anything into Django, no need to host own infrastructure, no need to ask people to trust anyone else (Google?) besides those whom they already trusted.

adamchainz · 2016-12-05T16:24:22Z

@aaugustin os.isatty on sys.stdout is normally a good proxy for if python is being invoked interactively (SO), so checking that as well before prompting could be an extra step to guard against the commands being used in automated scripts already.

dcramer · 2016-12-05T17:04:30Z

We don't have any numbers off hand in Sentry but these days we probably have accurate enough data to be able to identify many things. If it's something the Django team wants we would be happy to help, though it's possible id ask the Django community to write the draft script for the answers wanted.

rafalp · 2016-12-05T17:32:09Z

@evildmp Divio's DjangoCMS has shown message about 3.4 being out on recent project's admin week ago.

evildmp · 2016-12-05T18:22:22Z

@rafalp It's not django CMS doing that, it's django CMS Admin Style, an optional package.

jo-sm · 2016-12-05T23:10:28Z

Sorry to jump in late into the discussion. I think, regardless of what kind of tracking happens, tracking using Google is going to be more contentious than tracking by itself, and I would be very much against using any Google Analytics tracking (it would likely cause me to move to another framework). Tracking with another service, either open source or commercial, is more okay with me, and I would inquire to different analytics services since I'd be surprised if there isn't one that would offer free/reduced rates for an open source software project and initiative.

I also don't like the idea of tracking within the application itself. Tracking the manage.py/django-admin usage is one thing and I am relatively okay with it, but tracking in the application, either by the admin pages or via the runtime (unless it was only when the runtime starts and that's it) would be problematic because proprietary data could be sent: if the URL of the specific admin page was leaked, it could cause many issues within an organization that doesn't want to leak that info and trusting that the data wouldn't be tracked, especially with a service like Google Analytics, would mean that some organizations would either not upgrade to a newer version of Django, would leave to another framework, or would not choose it in the first place. I don't particularly care about my pet project leaking data from the admin pages but a bigger organization would, especially if it's in heath or the government.

Finally, will this data actually generate more fundraising opportunities for Django? I can understand the want for more data but have (potential) investors specifically stated that not having usage metrics causes them to be less open to investing? I'm curious because for a Python developer, Django is one of the "household names" and any developer would recognize it immediately and many have probably used it at one point in their development if they've ever done any web app work and so I'm surprised that the name alone wouldn't be enough for investors where data would. And I'd be doubly surprised that Pypi statistics aren't enough for investors, unless they specifically ask for usage metrics and not an easier to obtain number like downloads. In other words, is this a solution to a problem that does exist?

LilyFoote · 2016-12-06T01:10:06Z

@LegoStormtroopr without more detail my answer to many of your questions in the survey is "I don't know".

Lukasa · 2016-12-06T09:17:16Z

@LegoStormtroopr ❤️ Thanks for taking a constructive approach here.

apollo13 · 2016-12-06T10:12:12Z

@LegoStormtroopr The results look all nice and well (to some extend, I am currently having a hard time grasping why the second diagram has less responses -- does google collect partial answers?), but please provide raw access to the data (of all questions), otherwise the usability of this data is quite limited (to quote British Prime Minister Benjamin Disraeli: "There are three kinds of lies: lies, damned lies, and statistics." -- please don't read to much into that sentence though, I am not saying you are lying, I'd just would like to make a picture for myself out of raw data instead of diagrams which are somewhat biased to what you want to show).

I also disagree with the math to calculate the results, this is a highly optimistic calculation. Further more, what you easily clarify as "50% overhead" is where it comes really important. There are legal ramnifications to consider as well as how to run the program and how to do the certifications. Without a clear proposal there, your suggestion is not going to provide a viable alternative.

shaib · 2016-12-06T11:01:51Z

@nirgal I wish to apologize for losing some of my temper yesterday.

On Monday 05 December 2016 15:45:03 nirgal wrote: @shaib: paranoid delusion? Is this judgement constructive? "nothing will be reported without your explicit permission." I would like that very much... but several people are talking about an opt-out option...

I grant that this was, indeed, suggested. However, as far as I could see, all the people who started out supporting opt-out have already come around, with one notable exception -- the DEP author. Leaving aside for a second the issue of consensus over any tracking at all, I see a rough consensus forming against opt-out. So yes, I can say with a high level of certainty, nothing will be reported without your explicit permission.

apollo13 · 2016-12-06T11:39:15Z

On Tue, Dec 6, 2016, at 11:42 AM, Samuel Spencer wrote: @apollo13 I was working on that. But I'm out of spoons for this whole thing. Good luck. I'm out.

Sorry to hear that, is it to much to ask if you could just send me the raw data you gathered? Would be a shame to let it rot somewhere given that it exists.

keimlink · 2016-12-06T12:49:06Z

The collection of metrics could be limited for a specific period of time. After that time it will be validated to figure out if the metrics did help raising new funds. Then it can be decided to continue with collecting the metrics or not.

If this is something that we will do the following points should be added to the proposal:

The time after the effect of collecting metrics on fundraising will be evaluated.
The main criteria that will be used to evaluate the success.
Who decides about the continuation of collecting the metrics.

anarcat · 2016-12-06T20:25:56Z

On 2016-12-05 04:29:08, Aymeric Augustin wrote: For the sake of clarity, I've been repeatedly called out for arguing strongly in this thread, both privately and publicly. I plan to continue answering aggressive comments with a comparable level of energy.

I believe, on the contrary, that we should be "conservative in what we send and liberal in what we accept". We'd all be blind if we follow "an eye for an eye" attitude. I have said before, elsewhere, that I was impressed as to how well this discussion was going, and how it was showing a great maturity of the project that people were capable of arguing politely such a sensitive issue... I would hate to be proven wrong. Let's assume good faith, people, we're all in this together.

aaugustin · 2016-12-07T10:27:43Z

@anarcat Of course this isn't a contest. Escalation should be avoided.

I think that aggressive or out-of-place comments should get a firm response. I don't think insults should be answerd with insults.

aaugustin · 2016-12-11T22:11:17Z

Atom is another example of prior art, which many Django devs may be using already.

See https://github.com/atom/metrics for details.

I don't remember drama about this, perhaps because it had metrics built-in from day one.

wfdd · 2016-12-12T00:17:03Z

Atom metrics were made opt-in in a recent release. The first time you run Atom it asks you if you wanna enable metrics. Ironically, your response is logged even if you decide not to. See atom/atom#4966 and atom/atom#12281.

aaugustin · 2017-01-21T10:03:01Z

As a reference point, here's how the Google Cloud SDK achieves the same goal:

myk@mYk:/usr/local $ ./google-cloud-sdk/install.sh
Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true

Do you want to help improve the Google Cloud SDK (Y/n)?

nemesifier · 2017-03-28T08:57:51Z

This kind of feature would be very useful for open-source projects built with django too.
If this feature could be generalized so that open-source projects could re-use it to collect data, then those project could forward it to django. This way we would also be able to know what are the most widely deployed open source applications built with django.

zachborboa · 2017-04-20T09:39:09Z

draft/0008-gathering-usage-metrics.rst

+
+We believe that we've struck a balance that lets us gather the data we need
+for sustainability while respecting our users' privacy. And, it'll always
+be possible for users to disable this metric collection. We're hoping the vast


s/possible/possible and simple/

buddylindsey · 2017-04-21T20:13:08Z

I just wanted to leave this here as some further information of how others handle this. I was doing some goofing around with .NET Core and ran their new command line tools. When I went to do a dotnet new to create a new .NET Project I got the following message:

Telemetry
--------------
The .NET Core tools collect usage data in order to improve your experience. The data is anonymous and does not include command-line arguments. The data is collected by Microsoft and shared with the community.
You can opt out of telemetry by setting a DOTNET_CLI_TELEMETRY_OPTOUT environment variable to 1 using your favorite shell.
You can read more about .NET Core tools telemetry @ https://aka.ms/dotnet-cli-telemetry.

Here is a clickable link: https://aka.ms/dotnet-cli-telemetry

.NET Core is open source so it is an open source project gathering data. By running it you are notified that by default it is running.

benjaoming · 2018-06-20T11:16:45Z

*cough* GDPR *cough*

jarshwah · 2018-06-20T23:36:22Z

@benjaoming this DEP hasn't been updated or commented on in over a year, so I don't think it has the traction to actually go anywhere. However, will GDPR even apply?

I'm not very familiar with it, but doesn't it only apply to businesses operating within the EU? And even if it did apply, I would think the only requirement would be for consent, is that correct?

dcramer · 2018-06-21T00:04:03Z

While this isn't legal advice, I did run GDPR for Sentry, so consider these my informed opinions.

It generally doesn't apply unless you're collecting some kind of identifying information (which is how its also associated with tracking cookies). Even more importantly, these kinds of stats are often opt-in, which would satisfy consent needs under GDPR in the cases that it does contain e.g. contact information. These policies dont usually apply to businesses, but there's a fuzzy connection on if that's the use of Django (you could ask, just like you could ask for consent).

With that said I don't think we should use this ticket tracker as a debate on GDPR politics, and if it does get implemented, the maintainers should simply ensure privacy controls are present and up to standards.

jacobian · 2018-06-21T01:29:59Z

I don't have the energy to pursue this any longer, so as the person who started this I'll go ahead and close it. It's frustrating that even the most mild attempt to collect usage data results in such vitriol, but here we are. I really wish we had better information about who used Django, and how, but I'm just not willing to fight about it.

benjaoming · 2018-06-21T10:19:48Z

Times have changed is what I meant by tossing in GDPR, and as a bystander (reading this for the first time also), I was hoping that the DEP would be closed & rebooted because of that.

Good call @jacobian and amazing spirit about getting this far into the discussion and having so many respectable opinions in one DEP!

A lot of the thoughts and ideas from almost 2 years ago about for instance opt-out will likely not be presented the same way again today, post-GDPR. But the discussion was great and very enlightening. So that could serve to inform another DEP that perhaps is more of a "minimal set of acceptable, useful, GDPR + Debian Policy compliant analytics (opt-in)".

I wouldn't mind starting that work, if people are interested in a more modest approach?

jacobian added 2 commits November 5, 2016 17:45

Here, have some worms. They were canned, but I opened it.

5efbb8d

h/t homebrew

4fde38d

ericholscher reviewed Nov 5, 2016

View reviewed changes

alexwlchan reviewed Nov 5, 2016

View reviewed changes

aaugustin reviewed Nov 5, 2016

View reviewed changes

arikfr reviewed Nov 5, 2016

View reviewed changes

rbarrois reviewed Nov 6, 2016

View reviewed changes

tomdyson reviewed Nov 6, 2016

View reviewed changes

jezdez suggested changes Nov 7, 2016

View reviewed changes

adamchainz reviewed Nov 7, 2016

View reviewed changes

willingc reviewed Nov 7, 2016

View reviewed changes

zachborboa reviewed Apr 20, 2017

View reviewed changes

jacobian closed this Jun 21, 2018

timgraham deleted the metrics branch November 29, 2018 01:08

yarikoptic mentioned this pull request Feb 8, 2019

Collect usage statistics for various interfaces datalad/datalad#2906

Closed

		@@ -0,0 +1,306 @@
		=======================================
		DEP 8: Gathering Django usage analytics


		- Members of the Django Technical Board, upon request.

		- Members of the Django Infrastructure Team (so they can maintain the GA


		Users can disable analytics collection in two ways:

		1. By setting an environment variable: ``export DJANGO_NO_ANALYTICS=1``.


		We believe that collecting data by default is the only way we'll get a roughly
		accurate measure of Django's usage.

DEP 8: Gathering Django usage analytics #31

DEP 8: Gathering Django usage analytics #31

Conversation

jacobian commented Nov 5, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Nov 5, 2016 • edited

alexwlchan left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstufft commented Nov 5, 2016

shaib commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaib commented Nov 6, 2016

raphaelm commented Nov 6, 2016

aaugustin commented Nov 6, 2016 • edited

shaib commented Nov 6, 2016

shaib commented Nov 6, 2016

aaugustin commented Nov 6, 2016

Choose a reason for hiding this comment

rbarrois commented Nov 6, 2016

Choose a reason for hiding this comment

thomasgoirand commented Nov 7, 2016

jezdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhertzog commented Nov 7, 2016

raphaelm commented Nov 7, 2016

adamchainz commented Nov 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaugustin commented Nov 7, 2016 • edited

rhertzog commented Nov 7, 2016

adamchainz commented Nov 7, 2016

aaugustin commented Nov 7, 2016

aaugustin commented Nov 7, 2016

evildmp commented Dec 5, 2016

aaugustin commented Dec 5, 2016

stebunovd commented Dec 5, 2016

adamchainz commented Dec 5, 2016

dcramer commented Dec 5, 2016

rafalp commented Dec 5, 2016

evildmp commented Dec 5, 2016

jo-sm commented Dec 5, 2016

LilyFoote commented Dec 6, 2016

Lukasa commented Dec 6, 2016

apollo13 commented Dec 6, 2016 • edited

shaib commented Dec 6, 2016 via email

apollo13 commented Dec 6, 2016 via email

keimlink commented Dec 6, 2016 • edited

anarcat commented Dec 6, 2016 via email

aaugustin commented Dec 7, 2016 • edited

aaugustin commented Dec 11, 2016

wfdd commented Dec 12, 2016 • edited

jacobian commented Nov 5, 2016 •

edited

ericholscher commented Nov 5, 2016 •

edited

alexwlchan left a comment •

edited

aaugustin commented Nov 6, 2016 •

edited

aaugustin commented Nov 7, 2016 •

edited

apollo13 commented Dec 6, 2016 •

edited

keimlink commented Dec 6, 2016 •

edited

aaugustin commented Dec 7, 2016 •

edited

wfdd commented Dec 12, 2016 •

edited

benjaoming commented Jun 20, 2018 •

edited

dcramer commented Jun 21, 2018 •

edited