Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEP 8: Gathering Django usage analytics #31

Closed
wants to merge 2 commits into from
Closed

DEP 8: Gathering Django usage analytics #31

wants to merge 2 commits into from

Conversation

jacobian
Copy link
Member

@jacobian jacobian commented Nov 5, 2016

I want to start collecting some basic usage metrics so that it's easier for the DSF to raise money.

TODO before merging to master:

  • reference implementation
  • list prior art
  • clarify the cid bit and why we couldn't track a user even if we knew their id
  • other things I'm probably forgetting

Google Analytics vs other platforms/choices
-------------------------------------------

Using Google Analytics is a trade-off. On the one hand, Google's track record

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might I suggest running a proxy that sends this data along to GA? That way you can change to an API compatible endpoint in the future, without breaking deployed code. It would require running a proxy on your infra, but that is much less demanding than a full analytics install.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something I considered, and then discarded under the same reasoning as not running our own choice: I don't want to increase maintenance burden. That said, there are some good reasons to think about a proxy: the one you mentioned, as well as that it'll let us strip out the IP address which addresses the single remaining GA privacy concern. So might be worth thinking further here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be much in favor of a proxy of that kind.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy part is really interesting.
It helps in the future.

On another side it increases the increase maintenance burden as well.

@ericholscher
Copy link

ericholscher commented Nov 5, 2016

This Beacon implementation in Sentry is one we've been thinking about adding to Read the Docs, for similar reasons; https://github.com/getsentry/sentry/blob/bfc711ed2579d8588f99170c75d974af3d4c8e96/src/sentry/tasks/beacon.py#L32 -- it's a bit of different idea, but is good prior art.

Of note, it also allows sending a response that includes a message -- which could be useful for security notices. This is probably out of scope for the Django implementation, but might be another added user benefit of "phoning home" in dev.

Copy link
Contributor

@alexwlchan alexwlchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few very minor spelling/grammar suggestions, but otherwise this seems like a pretty sensible proposal. 👍

much easier to approach organizations for funding. As Eghbal writes:

[W]ithout data about which tools are used, and how much we rely upon them,
[it is hard to paint a clear picture of what is underfunded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra square bracket has crept in.


Analytics will be sent when certain ``django-admin`` commands are run:
``startproject``, ``startapp``, and ``runserver``. If a settings file
can be loaded (i.e. for``startapp`` and ``runserver``), analytics will only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space between for and startapp.

How will analytics be sent?
---------------------------

Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: should this be "HTTPS"?

@dstufft
Copy link
Member

dstufft commented Nov 5, 2016

I mentioned this to @jacobian in IRC but I figured I'd mention it here as well.

While it skews somewhat towards "number of downloads (not installs, downloads)" if there's something that can be added to pip or PyPI to aid in this goal I'm definitely interested in it. Of course we have the same privacy goals there as well, but if there's things that can be added on that front to help Django (and other projects) we can absolutely make something happen there. I've wanted to do it for awhile and I've just lacked time.

@shaib
Copy link
Member

shaib commented Nov 5, 2016

@dstufft Every time you install Debian, it asks you if you want to participate in a "popularity contest" which reports home which packages are installed. I guess we could add something like that to virtualenv. The package in Debian which takes care of this is called popcon.

the Google Analytics application name (`aid`_) and application version (`av`_).

- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
This is generated by the Python standard library function ``uuid.uuid1()`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use uuid4 instead to further minimize privacy concerns.

Who has access to analytics data?
---------------------------------

Access to the Google Analytics dashboard and data will be limited to the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it limited on purpose or because of the tool (GA)?

By design the data isn't supposed to contain any private data, so is there a reason not to share it with the public, assuming it's possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: it's a "defense in depth" sort of thing. Privacy is a real concern here, and I want to put as many controls around respecting privacy as possible. If we've somehow got something wrong and there's a way to de-anonymize the raw data, I want to limit the number of people who might be able to do so. Sharing summaries and reports is totally something I hope we'll do, but keeping access to the raw data as restricted as is reasonable seems like a good idea.

@shaib
Copy link
Member

shaib commented Nov 6, 2016

I think these can work much better if we don't try to do this on our own, but in coordination with other important open-source forces. I would love to find out what Debian and Fedora think of this; if they object, I suspect it could turn out ugly.

For the record, Debian's popcon is opt-in, but the installer makes sure to present the option. I'd feel much better about this if we can do something similar.

@raphaelm
Copy link

raphaelm commented Nov 6, 2016

@aaugustin I'm at the sprints today, but not the package manager for debian, I'm not even a debian developer. You probably confused me with @rhertzog who is listed on the debian page together with @lfaraone and @brianmay.

However, I know that debian in the past patched these kind of things out of packages, I just can't remember what packages that were from the top of my head.

@aaugustin
Copy link
Member

aaugustin commented Nov 6, 2016

If major distros stripped the analytics, it would be a shame. The packager for Debian, Raphael Michel (@raphaelm), was at the DUTH sprints yesterday. Jacob, if you're still there, perhaps you can talk to him?

The order of magnitude would likely remain correct, though, due to virtualenv and pip being the dominant installation method.

@shaib
Copy link
Member

shaib commented Nov 6, 2016

@aaugustin the technical differences might be minor, but the major distros' positions would be very important for public reception.

@shaib
Copy link
Member

shaib commented Nov 6, 2016

Another thought: if we're going there, we should probably be collecting usage of 3rd-parties -- at first I thought "apps", but perhaps more than that.
For apps, we could add a flag in AppConfig, defaulting to "no", which says "I want this app reported in the Django PopCon".
For other packages, this would probably require some setuptools API; I'm imagining some way for a "main" to ask its environment for a list of packages who want to be reported, and a way for a package to specify if it wants to be reported. Django could then use this API to do a real popcon. And I can easily see others using the same -- IPython & Anaconda are the first to come to mind.
@dstufft , thoughts?

@aaugustin
Copy link
Member

@raphaelm: sorry, I was thinking of Raphael Hertzog. Mea culpa!

they couldn't track Django users. As far as we can tell, the only thing Google
could do would be to lie about anonymizing IP addresses, and attempt to match
users based on their IPs. If we discovered Google was lying about this,
we'd obviously stop using them immediately.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless passing the stats through a DSF-operated proxy, how would you "stop using [Google] immediately"?
By that time, there would be thousands of installed Django projects still sending analytics to Google.

@rbarrois
Copy link

rbarrois commented Nov 6, 2016

Could you expand on the rules for sending events w.r.t project × developer combinations?

For a given team project, when should events be sent:

  • Once for the project, no matter how many developers
  • Once per developer on the team
  • As many times as developers start the runserver command?

The last one might raise more privacy issues — it is likely run quite often throughout the development cycle, whereas startapp / startproject are run much less often.

as much as we want this data, collecting it is simply too invasive.

Another option would be to collect data on the admin usage (e.g. embedding
Google Analytics directly). A couple of Django projects (Wagtail and Oscar)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wagtail doesn't embed Google Analytics, and although you haven't said this exactly, some readers could mistakenly conclude that it does. Would you mind removing the "(e.g. embedding Google Analytics directly)" clause to clarify this?

In case it's useful, Wagtail's approach is as follows:

  • for 'administrator' users (i.e. not standard editors) a non-blocking, client-side request is made to a text file on https://releases.wagtail.io, which is a CloudFront distribution
  • the text file contains the details of the latest stable version. If the current version is lower than the latest stable version, the administrator is alerted to the possibility of an upgrade
  • a script runs periodically to parse the CloudFront logs and report previously unseen domains (the referrer of the request)
  • it's opt-out. Site implementers can disable this behaviour with a documented settings config

We decided against the Google Analytics approach because we thought users would be nervous about enabling the transmission of server-to-server information, and because we only wanted to record the minimum useful usage data.

@thomasgoirand
Copy link

Hi,
Debian will, for sure, consider this as a privacy breach, and remove the code. If this isn't done, I'll file a bug, as for me, privacy is an important aspect of Debian. I would strongly recommend against hard-wiring this type of code without a fail-safe (ie: some easy way to disable it), and preferably, have it disabled by default.

As for my personal opinion about this, I'm also strongly against providing any information to Google. Not only this is a privacy breach, but also potentially a security problem (ie: making a query to Google may potentially inform about the version of Django that is running, which effectively can lead attackers to know what version of Django is running on the disclosed IP address).

Instead of this, why don't you just do a survey, and advertise about it in the manage.py command? Any form of advertising for such survey should be fine, provided that it doesn't do a privacy breach.

Copy link
Contributor

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some inline comments to specific parts of this proposal.

In short, I'm in favor of collecting usage metrics (and we should use the term "metrics", not "analytics" for terminology and trademark reasons), but need to be way more defensive to protect the privacy of our users. As such I propose:

  • either use a proxy to never hit Google Analytics directly and to filter out data that we don't want at all
  • or build the analysis tooling ourselves (e.g. using Re:dash or Apache Airflow)
  • opt-out by default with a smarter prompt to enable it when needed
  • legal review and drafting of privacy statements to cover for the transfer of data to a 3rd party under US jurisdiction
  • adopt something like Mozilla's Data Privacy Principles: https://www.mozilla.org/en-US/privacy/principles/
  • make "Datenvermeidung und Datensparsamkeit" (principles of data reduction and data economy) a topic every Django team member understands and practices

@@ -0,0 +1,306 @@
=======================================
DEP 8: Gathering Django usage analytics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/analytics/metrics/g

Specification
=============

Starting in version XXX, Django gathers anonymous user analytics and report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/anonymous user analytics/anonymous usage metrics/g

---------------------------

Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard
library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only happen via urllib2 on Python >= 2.7.9 or any other version that has the backported TLS cert validation feature. This should also say that the data is sent encrypted to Google Analytics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect that feature to land before Django drops support for Python 2 and I don't think it'll be backported to earlier releases.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen in China at all (Google is blocked, users will see ugly timeouts)

Access to the Google Analytics dashboard and data will be limited to the
following people/groups:

- The DSF President, in their role providing oversight to the DSF.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the board in general?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding:

Members of the Django Software Foundation Board, upon application.


- Members of the Django Technical Board, upon request.

- Members of the Django Infrastructure Team (so they can maintain the GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Django Infrastructure Team/Django Ops team/g


Users can disable analytics collection in two ways:

1. By setting an environment variable: ``export DJANGO_NO_ANALYTICS=1``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double negative, let's use DJANGO_USAGE_METRICS=0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree the double negative is annoying, I think it'd be easier if the API was "set the env var to anything not empty to change the behavior". What about DJANGO_DISABLE_METRICS=...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaugustin The problem with that technique (seen this with pip's env var support) is when people set DJANGO_DISABLE_METRICS=no and expect it to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd like to see a "defacto" standard emerge for disabling tracking across a number of tools that have begun adding analytics. It'd be far easier for privacy minded people to set a single env var, and know that a number of tools will honour it, without having to set a new one for each tool. I'm not opposed to a django specific one, especially if setting it to a specific value enabled tracking, but I'd be in favour of a general fall back key.

ANALTICS_DISABLED="whatever" # force no collection, hopefully other tools will begin honouring too. All values will disable.
DJANGO_NO_ANALTICS= # 0 enables analytics, every other value disables.

-- which raise some of the same concerns as above. So, we choose to only measure
upon certain commands that we can feel fairly certain won't be run in production.
This runs the risk of undercounting, but we think this is the best option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about commands such as collectstatic that may or may not be used in development but could provide useful information about the usefulness of Django features, especially when deciding the fate of contrib apps?


We believe that collecting data by default is the only way we'll get a roughly
accurate measure of Django's usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly opposed to enabling this by default, even "just for developers".

Doing so would risk an important step in building trust to new Django users and destroying existing trust with old Django users when informing them of the data collection upon the first call of a management command.

While this DEP describes an option to opt-out, it's a high-enough barrier that only the privacy minded users will take those extra steps to actual disable the collection. Since our goal as an Open Source project should be not to sacrifice the essential right for privacy (philosophically speaking, not politically) of our users we need to let them decide if (and maybe when) to enable data collection, however anonymized it may be.

Instead what I would suggest is to prompt the user after a few calls to the previously described management commands (and assuming not using --noinput in those calls) whether to enable data collection or not. That prompt should have a short overview of what data is collected and sent to Google Analytics as an example and a link to more elaborate documentation similar to https://www.mozilla.org/en-US/privacy/firefox/.

The prompt should only show up after a few calls to the management commands (e.g. 5) to prevent disturbing "the first impression" of new users -- nothing is more of a downer as being asked to answer prompts if you're deep in a tutorial learning. The tutorial documentation should be amended to mention the possibility of this prompt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of an opt out system, mainly for the reasons in the DEP.

  1. Most people don't care about tracking, and will not change the defaults no matter what.
  2. Those that do care about tracking will specifically disable it.
  3. Some small portion of users (true fans) will enable tracking.

If this were to be an opt-in system, it's totally useless. You miss the analytics of the largest group of users who either don't mind, or don't care enough to mind. Opt in systems don't work. Defaults matter. There will be a small, extremely vocal, group of privacy advocates who will swear off Django and call us all sorts of names. We have to weigh up the cost/benefit ratio and decide if that matters. If it does, then I wouldn't even bother with an Opt In system.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the user must explicitly consent with the collection of data. This may be through a prompt of 'Do you agree on sending some basic info to Django for statistical reasons', but the user has to explicitly agree on the sending of data.
You cannot just start collecting data that you (the Django project) use for your own gains without the consent of the user.

one out there.

We've carefully chosen what to send to GA so that even if Google turns evil
they couldn't track Django users. As far as we can tell, the only thing Google
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to you, "if Google turns evil" is a pretty handwavy thing to say. Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services, so I think we should acknowledge that if we send data to Google it's to an entity that is under such jurisdiction. So this isn't about "evilness" but simply about the legal framework under which Google works, which a global audience such as the Django users need to understand and acknowledge when using Django while it sends the metrics.

I would argue that for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option. E.g. there exists the legal obligation to note the use of data analysis tools such as Google Analytics in the "imprint", terms of service and privacy statements (Datenschutzrichtlinie) of websites.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jezdez Do you know if fully anonymized metrics still fall under privacy protection laws? Perhaps because the IP address is considered personal data? (I'm not sure about the latest developments on whether IP addresses are personal data.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this raises the question if the mere possibility of getting the IP address from the network connection is an issue. Since don't store it, I'm not sure it's a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IP addresses are PII when they are linked to a person verifiably, e.g. they are static and not dynamically set like with dial-up and some forms of broadband connections. And even then since ISPs store IP addresses, especially with recent data retention laws coming into effect, they are linked to customer records and are linked indirectly that way. Either way, it's a grey area, at which the safest bet for Django and our users is to not store them at all (data reduction principle).

Some more details about this: https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion seems to be converging on the use of some sort of proxy, which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves.

If we strip them entirely, we lose the ability to analyze the geographical distribution of Django users.

@rhertzog
Copy link

rhertzog commented Nov 7, 2016

Hello, as a Debian packager, I can say that this would likely be disabled by default in Debian.

I would rather suggest something more visible and more intrusive: if ~/.local/django/developer-id does not exist, then the strartproject/startapp/runserver run a new command "registerdeveloper" which invites the developer to register in some way. He can just decide to share its existence throught a random uuid assigned to him (and stored in the above file) or he can share more if he wishes (I'll let you figure out what interests you).

For scripting purpose, you could do "django-admin registerdeveloper --disable" so that you are not bugged with the interactive questions when you don't want them.

Associated to the developer id, there could be a timestamp so that each year the developer is invited to update/expand its entry.

@raphaelm
Copy link

raphaelm commented Nov 7, 2016

Just so that this has been mentioned: Besides the privacy issue, we also need to make sure that this call is done in a non-blocking way. It is not acceptable that this makes runserver slower to use when for example developing while tethering over a bad cellular connection (which I often do).

@adamchainz
Copy link
Sponsor Member

The DEP states:

potential sponsors always ask for data

and then:

A major reason that fundraising remains difficult is our inability to measure
the size of the Django community.

and then:

Our goal is to try to measure "unique developers"

However in-between there isn't much discussion as to whether 'community size'
is really what sponsors want to see, or that other metrics than 'unique
developers' have been considered. There are a few bracketed clauses about why
some kinds of metrics aren't useful ("e.g. number of times Django's been
installed"), but it doesn't clearly explain to me why tracking developers has
been settled upon.

One alternative would be to estimate what percentage of the top N websites (by
estimated traffic) use Django, using simple signals such as the presence of
/admin/ or the CSRF token. This is public data by virtue of the site being
online.

In fact, with a few minutes of Googling I can find two websites that already
use such techniques to track usage of Django (plus other tools) -
Siftery tracks 1923 sites and
Builtwith tracks 46,000.

I'd like to see the DEP consider several metrics (including non-invasive
profiling), and then justify its the choice of tracking developers.


- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``.
This is generated by the Python standard library function ``uuid.uuid1()`` and
stored in ``~/.config/djangoanalytics`` (or equivalent on non-Linux
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I work on several Django projects mounted in VM's, and rebuild these VM's every few weeks. With this tracking scheme I would be counted as a new developer once for each project and every time I rebuild any VM. This problem would be compounded with Docker setups where developers might rebuild their containers several times a day.

Rationale
=========

The high-level rationale is explained in the `Abstract`_: gather data that we
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor edit: remove extra space between that and we

@aaugustin
Copy link
Member

aaugustin commented Nov 7, 2016

There's little point in gathering metrics if they're disabled by default. Most people will stick with the default (basically hit enter-enter-enter-enter until Django stops asking questions and move on). Then we're back to the kind of results a community survey would provide. We already know that's insufficient to convince potential sponsors (as explained in the DEP).

Thomas Goirand's comment suggests that vocal activits will do whatever needed to make downstream distributors strip the metrics code, for philosophical reasons that aren't open for discussion. As far as I can tell, it's not about how Django gathers metrics, it's just about the principle of gathering metrics. How that hurts Free Software isn't a consideration. I'm afraid it's a holy war (dare I say jihad?) and it will be hard to escape.

Django's liberal licensing allows distributors to change the code. Certainly we don't want to get into a trademark fight à la Firefox vs. Iceweasel over this. In my opinion that'll just be another reason to encourage users not to install Django with system packages. We already believe virtualenv/pyvenv is the better option for technical and practical reasons.

It's sad that we'll sound like we criticise distributors when we have to explain why metrics are underevaluated. I guess that's life.


On a more constructive note, I think that describing more precisely what we want to measure and giving more control to developers who want to trigger metrics in the right circumstances would be a good thing.

@rhertzog
Copy link

rhertzog commented Nov 7, 2016

@aaugustin There's some truth in what you say, but please don't use big words like jihad. Debian with its policy provides security to its user because precisely we bring some third-party review between upstreams and themselves.

Phoning home is bad enough, but if you opt to use Google, then I'm pretty sure it won't fly. It's a matter of trust... Google will have access to data that they claim they will not store. Django has no way to verify that.

I made a suggestion a few minutes ago, I would like to hear your thoughts on it instead of being considered as a member of an extremist project.

I am particularly interested by the problem of funding free software and I agree with you that the metrics would be useful. But even more useful would be a database of Django developers and Django-using companies. So why not build a service to manage such a database and make it trivial to feed that database from the command line client? Such a service would be useful not only to Django but to the wider free software ecosystem. I would likely (try to) deploy it for Debian too...

@adamchainz
Copy link
Sponsor Member

a database of Django developers and Django-using companies

N.B. there's actually such a database already at https://www.djangosites.org/ listing 5179 websites, though it's a third-party project for which the code hasn't been updated since 2013.

@aaugustin
Copy link
Member

@rhertzog I appreciate that you're open to discussion and constructive. For the avoidance of doubts, I believe that the majority of Debian Developers are also willing to have a productive discussion. It's less clear to me that the majority can prevail on such discussions on the Internet; that's the risk I'm afraid we have to prepare for.

Regarding your suggestion, I agree that we need to give specific examples of what the user experience looks like. I'm less enthusiastic about the idea of "registering"; you shouldn't have to register anywhere to use Django.

The current proposal says:

Metric collection will be enabled by default, and users will be warned the first time they run django-admin.

Perhaps it could look like this?

$ django-admin startproject foobar
This is the first time you're running django-admin on this system.

django-admin anonymously reports your version of Python and Django when you
run startproject, startapp, or runserver. These metrics allow the Django
Software Foundation to estimate the size of the community and to raise money
from sponsors. For details, see https://.../.

As a small contribution to the DSF, we ask you to accept reporting these
metrics. If you prefer, you can also donate to the DSF here: https://.../.

To disable metrics, set the DJANGO_DISABLE_METRICS environment variable to 1.

Continue? [Y/n]

@rhertzog: is that the sort of "more visible and more intrusive" thing you had in mind?

@jezdez: do you think that would be a sufficient opt-in to alleviate that part of your concerns?

I can't say if that's compatible with what @jacobian has in mind. We'll need his input when he's back from Django: Under the Hood and has caught up with the comments.

@aaugustin
Copy link
Member

The other idea -- gathering a list of companies using Django -- was briefly discussed during the Django: Under the Hood sprints but ruled out for privacy reasons.

We don't really want to know who's using Django and we're not looking forward to being responsible for a list of websites that you can hack if you find a zero-day in Django...

@evildmp
Copy link

evildmp commented Dec 5, 2016

@aaugustin Thanks for breaking down the list of opt-in/opt-out options like that.

One question I have: options 3-5 only make sense when there is some kind of interaction with the user. In automated deployments, the opportunity for obtaining the user's consent would be lost.

This pushes the scope for the assent mechanism back to something else, for example the Django settings, and there we face more or less the same list of options a different context, with the additional difficulty that the giving or withholding of assent is much less obvious when it's tucked away in a settings.py.

If using settings is less satisfactory (especially for assent by default), then should we also use interactive prompt opt-in? Then we'd be maintaining two different mechanisms. Which would override the other, and how would the user or developer be sure of their state?

@aaugustin
Copy link
Member

@evildmp

My first thought is that django-admin <command> --no-input could just go with the default, under the assumption that --no-input means that you ask for the default default choice, but that may not be considered sufficient opt-in, especially since it didn't have this effect in previous releases. Perhaps it's safer for --no-input no skip metrics collections and ask next time the command is run interactively.

Also there's the issue of backwards compatibility. AFAICT the commands currently targeted by this proposal are non-interactive. Making them interactive by default (unless --no-input is added) could be a problem for some automation scenarios (and trigger irate feedback).

@stebunovd
Copy link

One more option, not sure if you considered this. If you need to get the data about Django usage, you can try to partner with companies which already have the data. This is quite common for developers to monitor their production environments with tools like Sentry or New Relic. Some people are hosting in environments like Heroku, where it's pretty easy to know their stack. Maybe they won't mind to share anonymous total stats? We could ask @dcramer for example.

Of course this won't give us absolute numbers of all Django installations in the world, because some sites are not using any monitoring at all. However if we look at the data maybe we could find something useful in it basing on relative estimates, like popularity of Django vs. framework X, and using open data about framework X get an estimate of Django installations.

Benefit of this approach - no need to add anything into Django, no need to host own infrastructure, no need to ask people to trust anyone else (Google?) besides those whom they already trusted.

@adamchainz
Copy link
Sponsor Member

@aaugustin os.isatty on sys.stdout is normally a good proxy for if python is being invoked interactively (SO), so checking that as well before prompting could be an extra step to guard against the commands being used in automated scripts already.

@dcramer
Copy link
Sponsor

dcramer commented Dec 5, 2016

We don't have any numbers off hand in Sentry but these days we probably have accurate enough data to be able to identify many things. If it's something the Django team wants we would be happy to help, though it's possible id ask the Django community to write the draft script for the answers wanted.

@rafalp
Copy link

rafalp commented Dec 5, 2016

@evildmp Divio's DjangoCMS has shown message about 3.4 being out on recent project's admin week ago.

@evildmp
Copy link

evildmp commented Dec 5, 2016

@rafalp It's not django CMS doing that, it's django CMS Admin Style, an optional package.

@jo-sm
Copy link

jo-sm commented Dec 5, 2016

Sorry to jump in late into the discussion. I think, regardless of what kind of tracking happens, tracking using Google is going to be more contentious than tracking by itself, and I would be very much against using any Google Analytics tracking (it would likely cause me to move to another framework). Tracking with another service, either open source or commercial, is more okay with me, and I would inquire to different analytics services since I'd be surprised if there isn't one that would offer free/reduced rates for an open source software project and initiative.

I also don't like the idea of tracking within the application itself. Tracking the manage.py/django-admin usage is one thing and I am relatively okay with it, but tracking in the application, either by the admin pages or via the runtime (unless it was only when the runtime starts and that's it) would be problematic because proprietary data could be sent: if the URL of the specific admin page was leaked, it could cause many issues within an organization that doesn't want to leak that info and trusting that the data wouldn't be tracked, especially with a service like Google Analytics, would mean that some organizations would either not upgrade to a newer version of Django, would leave to another framework, or would not choose it in the first place. I don't particularly care about my pet project leaking data from the admin pages but a bigger organization would, especially if it's in heath or the government.

Finally, will this data actually generate more fundraising opportunities for Django? I can understand the want for more data but have (potential) investors specifically stated that not having usage metrics causes them to be less open to investing? I'm curious because for a Python developer, Django is one of the "household names" and any developer would recognize it immediately and many have probably used it at one point in their development if they've ever done any web app work and so I'm surprised that the name alone wouldn't be enough for investors where data would. And I'd be doubly surprised that Pypi statistics aren't enough for investors, unless they specifically ask for usage metrics and not an easier to obtain number like downloads. In other words, is this a solution to a problem that does exist?

@LilyFoote
Copy link

@LegoStormtroopr without more detail my answer to many of your questions in the survey is "I don't know".

@Lukasa
Copy link

Lukasa commented Dec 6, 2016

@LegoStormtroopr ❤️ Thanks for taking a constructive approach here.

@apollo13
Copy link
Member

apollo13 commented Dec 6, 2016

@LegoStormtroopr The results look all nice and well (to some extend, I am currently having a hard time grasping why the second diagram has less responses -- does google collect partial answers?), but please provide raw access to the data (of all questions), otherwise the usability of this data is quite limited (to quote British Prime Minister Benjamin Disraeli: "There are three kinds of lies: lies, damned lies, and statistics." -- please don't read to much into that sentence though, I am not saying you are lying, I'd just would like to make a picture for myself out of raw data instead of diagrams which are somewhat biased to what you want to show).

I also disagree with the math to calculate the results, this is a highly optimistic calculation. Further more, what you easily clarify as "50% overhead" is where it comes really important. There are legal ramnifications to consider as well as how to run the program and how to do the certifications. Without a clear proposal there, your suggestion is not going to provide a viable alternative.

@shaib
Copy link
Member

shaib commented Dec 6, 2016 via email

@apollo13
Copy link
Member

apollo13 commented Dec 6, 2016 via email

@keimlink
Copy link

keimlink commented Dec 6, 2016

The collection of metrics could be limited for a specific period of time. After that time it will be validated to figure out if the metrics did help raising new funds. Then it can be decided to continue with collecting the metrics or not.

If this is something that we will do the following points should be added to the proposal:

  1. The time after the effect of collecting metrics on fundraising will be evaluated.
  2. The main criteria that will be used to evaluate the success.
  3. Who decides about the continuation of collecting the metrics.

@anarcat
Copy link

anarcat commented Dec 6, 2016 via email

@aaugustin
Copy link
Member

aaugustin commented Dec 7, 2016

@anarcat Of course this isn't a contest. Escalation should be avoided.

I think that aggressive or out-of-place comments should get a firm response. I don't think insults should be answerd with insults.

@aaugustin
Copy link
Member

Atom is another example of prior art, which many Django devs may be using already.

See https://github.com/atom/metrics for details.

I don't remember drama about this, perhaps because it had metrics built-in from day one.

@wfdd
Copy link

wfdd commented Dec 12, 2016

Atom metrics were made opt-in in a recent release. The first time you run Atom it asks you if you wanna enable metrics. Ironically, your response is logged even if you decide not to. See atom/atom#4966 and atom/atom#12281.

@aaugustin
Copy link
Member

As a reference point, here's how the Google Cloud SDK achieves the same goal:

myk@mYk:/usr/local $ ./google-cloud-sdk/install.sh
Welcome to the Google Cloud SDK!

To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. You may choose
to opt out of this collection now (by choosing 'N' at the below prompt), or at
any time in the future by running the following command:

    gcloud config set disable_usage_reporting true

Do you want to help improve the Google Cloud SDK (Y/n)?

@nemesifier
Copy link

This kind of feature would be very useful for open-source projects built with django too.
If this feature could be generalized so that open-source projects could re-use it to collect data, then those project could forward it to django. This way we would also be able to know what are the most widely deployed open source applications built with django.


We believe that we've struck a balance that lets us gather the data we need
for sustainability while respecting our users' privacy. And, it'll always
be possible for users to disable this metric collection. We're hoping the vast

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/possible/possible and simple/

@buddylindsey
Copy link

I just wanted to leave this here as some further information of how others handle this. I was doing some goofing around with .NET Core and ran their new command line tools. When I went to do a dotnet new to create a new .NET Project I got the following message:

Telemetry
--------------
The .NET Core tools collect usage data in order to improve your experience. The data is anonymous and does not include command-line arguments. The data is collected by Microsoft and shared with the community.
You can opt out of telemetry by setting a DOTNET_CLI_TELEMETRY_OPTOUT environment variable to 1 using your favorite shell.
You can read more about .NET Core tools telemetry @ https://aka.ms/dotnet-cli-telemetry.

Here is a clickable link: https://aka.ms/dotnet-cli-telemetry

.NET Core is open source so it is an open source project gathering data. By running it you are notified that by default it is running.

@benjaoming
Copy link

benjaoming commented Jun 20, 2018

*cough* GDPR *cough*

@jarshwah
Copy link
Member

@benjaoming this DEP hasn't been updated or commented on in over a year, so I don't think it has the traction to actually go anywhere. However, will GDPR even apply?

I'm not very familiar with it, but doesn't it only apply to businesses operating within the EU? And even if it did apply, I would think the only requirement would be for consent, is that correct?

@dcramer
Copy link
Sponsor

dcramer commented Jun 21, 2018

While this isn't legal advice, I did run GDPR for Sentry, so consider these my informed opinions.

It generally doesn't apply unless you're collecting some kind of identifying information (which is how its also associated with tracking cookies). Even more importantly, these kinds of stats are often opt-in, which would satisfy consent needs under GDPR in the cases that it does contain e.g. contact information. These policies dont usually apply to businesses, but there's a fuzzy connection on if that's the use of Django (you could ask, just like you could ask for consent).

With that said I don't think we should use this ticket tracker as a debate on GDPR politics, and if it does get implemented, the maintainers should simply ensure privacy controls are present and up to standards.

@jacobian
Copy link
Member Author

I don't have the energy to pursue this any longer, so as the person who started this I'll go ahead and close it. It's frustrating that even the most mild attempt to collect usage data results in such vitriol, but here we are. I really wish we had better information about who used Django, and how, but I'm just not willing to fight about it.

@jacobian jacobian closed this Jun 21, 2018
@benjaoming
Copy link

Times have changed is what I meant by tossing in GDPR, and as a bystander (reading this for the first time also), I was hoping that the DEP would be closed & rebooted because of that.

Good call @jacobian and amazing spirit about getting this far into the discussion and having so many respectable opinions in one DEP!

A lot of the thoughts and ideas from almost 2 years ago about for instance opt-out will likely not be presented the same way again today, post-GDPR. But the discussion was great and very enlightening. So that could serve to inform another DEP that perhaps is more of a "minimal set of acceptable, useful, GDPR + Debian Policy compliant analytics (opt-in)".

I wouldn't mind starting that work, if people are interested in a more modest approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet