-
Notifications
You must be signed in to change notification settings - Fork 85
DEP 8: Gathering Django usage analytics #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Google Analytics vs other platforms/choices | ||
------------------------------------------- | ||
|
||
Using Google Analytics is a trade-off. On the one hand, Google's track record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might I suggest running a proxy that sends this data along to GA? That way you can change to an API compatible endpoint in the future, without breaking deployed code. It would require running a proxy on your infra, but that is much less demanding than a full analytics install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's something I considered, and then discarded under the same reasoning as not running our own choice: I don't want to increase maintenance burden. That said, there are some good reasons to think about a proxy: the one you mentioned, as well as that it'll let us strip out the IP address which addresses the single remaining GA privacy concern. So might be worth thinking further here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be much in favor of a proxy of that kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proxy part is really interesting.
It helps in the future.
On another side it increases the increase maintenance burden as well.
This Beacon implementation in Sentry is one we've been thinking about adding to Read the Docs, for similar reasons; https://github.com/getsentry/sentry/blob/bfc711ed2579d8588f99170c75d974af3d4c8e96/src/sentry/tasks/beacon.py#L32 -- it's a bit of different idea, but is good prior art. Of note, it also allows sending a response that includes a message -- which could be useful for security notices. This is probably out of scope for the Django implementation, but might be another added user benefit of "phoning home" in dev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few very minor spelling/grammar suggestions, but otherwise this seems like a pretty sensible proposal. 👍
much easier to approach organizations for funding. As Eghbal writes: | ||
|
||
[W]ithout data about which tools are used, and how much we rely upon them, | ||
[it is hard to paint a clear picture of what is underfunded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra square bracket has crept in.
|
||
Analytics will be sent when certain ``django-admin`` commands are run: | ||
``startproject``, ``startapp``, and ``runserver``. If a settings file | ||
can be loaded (i.e. for``startapp`` and ``runserver``), analytics will only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space between for
and startapp
.
How will analytics be sent? | ||
--------------------------- | ||
|
||
Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: should this be "HTTPS"?
I mentioned this to @jacobian in IRC but I figured I'd mention it here as well. While it skews somewhat towards "number of downloads (not installs, downloads)" if there's something that can be added to pip or PyPI to aid in this goal I'm definitely interested in it. Of course we have the same privacy goals there as well, but if there's things that can be added on that front to help Django (and other projects) we can absolutely make something happen there. I've wanted to do it for awhile and I've just lacked time. |
@dstufft Every time you install Debian, it asks you if you want to participate in a "popularity contest" which reports home which packages are installed. I guess we could add something like that to virtualenv. The package in Debian which takes care of this is called popcon. |
the Google Analytics application name (`aid`_) and application version (`av`_). | ||
|
||
- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``. | ||
This is generated by the Python standard library function ``uuid.uuid1()`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use uuid4
instead to further minimize privacy concerns.
Who has access to analytics data? | ||
--------------------------------- | ||
|
||
Access to the Google Analytics dashboard and data will be limited to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it limited on purpose or because of the tool (GA)?
By design the data isn't supposed to contain any private data, so is there a reason not to share it with the public, assuming it's possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: it's a "defense in depth" sort of thing. Privacy is a real concern here, and I want to put as many controls around respecting privacy as possible. If we've somehow got something wrong and there's a way to de-anonymize the raw data, I want to limit the number of people who might be able to do so. Sharing summaries and reports is totally something I hope we'll do, but keeping access to the raw data as restricted as is reasonable seems like a good idea.
I think these can work much better if we don't try to do this on our own, but in coordination with other important open-source forces. I would love to find out what Debian and Fedora think of this; if they object, I suspect it could turn out ugly. For the record, Debian's popcon is opt-in, but the installer makes sure to present the option. I'd feel much better about this if we can do something similar. |
@aaugustin I'm at the sprints today, but not the package manager for debian, I'm not even a debian developer. You probably confused me with @rhertzog who is listed on the debian page together with @lfaraone and @brianmay. However, I know that debian in the past patched these kind of things out of packages, I just can't remember what packages that were from the top of my head. |
If major distros stripped the analytics, it would be a shame. The packager for Debian, Raphael Michel (@raphaelm), was at the DUTH sprints yesterday. Jacob, if you're still there, perhaps you can talk to him? The order of magnitude would likely remain correct, though, due to virtualenv and pip being the dominant installation method. |
@aaugustin the technical differences might be minor, but the major distros' positions would be very important for public reception. |
Another thought: if we're going there, we should probably be collecting usage of 3rd-parties -- at first I thought "apps", but perhaps more than that. |
@raphaelm: sorry, I was thinking of Raphael Hertzog. Mea culpa! |
they couldn't track Django users. As far as we can tell, the only thing Google | ||
could do would be to lie about anonymizing IP addresses, and attempt to match | ||
users based on their IPs. If we discovered Google was lying about this, | ||
we'd obviously stop using them immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless passing the stats through a DSF-operated proxy, how would you "stop using [Google] immediately"?
By that time, there would be thousands of installed Django projects still sending analytics to Google.
Could you expand on the rules for sending events w.r.t project × developer combinations? For a given team project, when should events be sent:
The last one might raise more privacy issues — it is likely run quite often throughout the development cycle, whereas |
as much as we want this data, collecting it is simply too invasive. | ||
|
||
Another option would be to collect data on the admin usage (e.g. embedding | ||
Google Analytics directly). A couple of Django projects (Wagtail and Oscar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wagtail doesn't embed Google Analytics, and although you haven't said this exactly, some readers could mistakenly conclude that it does. Would you mind removing the "(e.g. embedding Google Analytics directly)" clause to clarify this?
In case it's useful, Wagtail's approach is as follows:
- for 'administrator' users (i.e. not standard editors) a non-blocking, client-side request is made to a text file on https://releases.wagtail.io, which is a CloudFront distribution
- the text file contains the details of the latest stable version. If the current version is lower than the latest stable version, the administrator is alerted to the possibility of an upgrade
- a script runs periodically to parse the CloudFront logs and report previously unseen domains (the referrer of the request)
- it's opt-out. Site implementers can disable this behaviour with a documented settings config
We decided against the Google Analytics approach because we thought users would be nervous about enabling the transmission of server-to-server information, and because we only wanted to record the minimum useful usage data.
Hi, As for my personal opinion about this, I'm also strongly against providing any information to Google. Not only this is a privacy breach, but also potentially a security problem (ie: making a query to Google may potentially inform about the version of Django that is running, which effectively can lead attackers to know what version of Django is running on the disclosed IP address). Instead of this, why don't you just do a survey, and advertise about it in the manage.py command? Any form of advertising for such survey should be fine, provided that it doesn't do a privacy breach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some inline comments to specific parts of this proposal.
In short, I'm in favor of collecting usage metrics (and we should use the term "metrics", not "analytics" for terminology and trademark reasons), but need to be way more defensive to protect the privacy of our users. As such I propose:
- either use a proxy to never hit Google Analytics directly and to filter out data that we don't want at all
- or build the analysis tooling ourselves (e.g. using Re:dash or Apache Airflow)
- opt-out by default with a smarter prompt to enable it when needed
- legal review and drafting of privacy statements to cover for the transfer of data to a 3rd party under US jurisdiction
- adopt something like Mozilla's Data Privacy Principles: https://www.mozilla.org/en-US/privacy/principles/
- make "Datenvermeidung und Datensparsamkeit" (principles of data reduction and data economy) a topic every Django team member understands and practices
@@ -0,0 +1,306 @@ | |||
======================================= | |||
DEP 8: Gathering Django usage analytics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/analytics/metrics/g
Specification | ||
============= | ||
|
||
Starting in version XXX, Django gathers anonymous user analytics and report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/anonymous user analytics/anonymous usage metrics/g
--------------------------- | ||
|
||
Data is sent to Google Analytics over HTTPs using Python's ``urllib2`` standard | ||
library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only happen via urllib2 on Python >= 2.7.9 or any other version that has the backported TLS cert validation feature. This should also say that the data is sent encrypted to Google Analytics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect that feature to land before Django drops support for Python 2 and I don't think it'll be backported to earlier releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not happen in China at all (Google is blocked, users will see ugly timeouts)
Access to the Google Analytics dashboard and data will be limited to the | ||
following people/groups: | ||
|
||
- The DSF President, in their role providing oversight to the DSF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the board in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding:
Members of the Django Software Foundation Board, upon application.
|
||
- Members of the Django Technical Board, upon request. | ||
|
||
- Members of the Django Infrastructure Team (so they can maintain the GA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Django Infrastructure Team/Django Ops team/g
|
||
Users can disable analytics collection in two ways: | ||
|
||
1. By setting an environment variable: ``export DJANGO_NO_ANALYTICS=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double negative, let's use DJANGO_USAGE_METRICS=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree the double negative is annoying, I think it'd be easier if the API was "set the env var to anything not empty to change the behavior". What about DJANGO_DISABLE_METRICS=...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaugustin The problem with that technique (seen this with pip's env var support) is when people set DJANGO_DISABLE_METRICS=no
and expect it to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd like to see a "defacto" standard emerge for disabling tracking across a number of tools that have begun adding analytics. It'd be far easier for privacy minded people to set a single env var, and know that a number of tools will honour it, without having to set a new one for each tool. I'm not opposed to a django specific one, especially if setting it to a specific value enabled tracking, but I'd be in favour of a general fall back key.
ANALTICS_DISABLED="whatever" # force no collection, hopefully other tools will begin honouring too. All values will disable.
DJANGO_NO_ANALTICS= # 0 enables analytics, every other value disables.
-- which raise some of the same concerns as above. So, we choose to only measure | ||
upon certain commands that we can feel fairly certain won't be run in production. | ||
This runs the risk of undercounting, but we think this is the best option. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about commands such as collectstatic
that may or may not be used in development but could provide useful information about the usefulness of Django features, especially when deciding the fate of contrib apps?
|
||
We believe that collecting data by default is the only way we'll get a roughly | ||
accurate measure of Django's usage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly opposed to enabling this by default, even "just for developers".
Doing so would risk an important step in building trust to new Django users and destroying existing trust with old Django users when informing them of the data collection upon the first call of a management command.
While this DEP describes an option to opt-out, it's a high-enough barrier that only the privacy minded users will take those extra steps to actual disable the collection. Since our goal as an Open Source project should be not to sacrifice the essential right for privacy (philosophically speaking, not politically) of our users we need to let them decide if (and maybe when) to enable data collection, however anonymized it may be.
Instead what I would suggest is to prompt the user after a few calls to the previously described management commands (and assuming not using --noinput
in those calls) whether to enable data collection or not. That prompt should have a short overview of what data is collected and sent to Google Analytics as an example and a link to more elaborate documentation similar to https://www.mozilla.org/en-US/privacy/firefox/.
The prompt should only show up after a few calls to the management commands (e.g. 5) to prevent disturbing "the first impression" of new users -- nothing is more of a downer as being asked to answer prompts if you're deep in a tutorial learning. The tutorial documentation should be amended to mention the possibility of this prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of an opt out system, mainly for the reasons in the DEP.
- Most people don't care about tracking, and will not change the defaults no matter what.
- Those that do care about tracking will specifically disable it.
- Some small portion of users (true fans) will enable tracking.
If this were to be an opt-in system, it's totally useless. You miss the analytics of the largest group of users who either don't mind, or don't care enough to mind. Opt in systems don't work. Defaults matter. There will be a small, extremely vocal, group of privacy advocates who will swear off Django and call us all sorts of names. We have to weigh up the cost/benefit ratio and decide if that matters. If it does, then I wouldn't even bother with an Opt In system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the user must explicitly consent with the collection of data. This may be through a prompt of 'Do you agree on sending some basic info to Django for statistical reasons', but the user has to explicitly agree on the sending of data.
You cannot just start collecting data that you (the Django project) use for your own gains without the consent of the user.
one out there. | ||
|
||
We've carefully chosen what to send to GA so that even if Google turns evil | ||
they couldn't track Django users. As far as we can tell, the only thing Google |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to you, "if Google turns evil" is a pretty handwavy thing to say. Google has previously been subjected to secret US court orders and was required to collaborate in mass surveillance conducted by US intelligence services, so I think we should acknowledge that if we send data to Google it's to an entity that is under such jurisdiction. So this isn't about "evilness" but simply about the legal framework under which Google works, which a global audience such as the Django users need to understand and acknowledge when using Django while it sends the metrics.
I would argue that for example the strict German (and by extension EU) privacy laws would exclude the automatic opt-in as a lawful option. E.g. there exists the legal obligation to note the use of data analysis tools such as Google Analytics in the "imprint", terms of service and privacy statements (Datenschutzrichtlinie) of websites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jezdez Do you know if fully anonymized metrics still fall under privacy protection laws? Perhaps because the IP address is considered personal data? (I'm not sure about the latest developments on whether IP addresses are personal data.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this raises the question if the mere possibility of getting the IP address from the network connection is an issue. Since don't store it, I'm not sure it's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IP addresses are PII when they are linked to a person verifiably, e.g. they are static and not dynamically set like with dial-up and some forms of broadband connections. And even then since ISPs store IP addresses, especially with recent data retention laws coming into effect, they are linked to customer records and are linked indirectly that way. Either way, it's a grey area, at which the safest bet for Django and our users is to not store them at all (data reduction principle).
Some more details about this: https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion seems to be converging on the use of some sort of proxy, which would allow us to strip IP addresses instead of relying on Google to anonymize them, or to anonymize them ourselves.
If we strip them entirely, we lose the ability to analyze the geographical distribution of Django users.
Hello, as a Debian packager, I can say that this would likely be disabled by default in Debian. I would rather suggest something more visible and more intrusive: if ~/.local/django/developer-id does not exist, then the strartproject/startapp/runserver run a new command "registerdeveloper" which invites the developer to register in some way. He can just decide to share its existence throught a random uuid assigned to him (and stored in the above file) or he can share more if he wishes (I'll let you figure out what interests you). For scripting purpose, you could do "django-admin registerdeveloper --disable" so that you are not bugged with the interactive questions when you don't want them. Associated to the developer id, there could be a timestamp so that each year the developer is invited to update/expand its entry. |
Just so that this has been mentioned: Besides the privacy issue, we also need to make sure that this call is done in a non-blocking way. It is not acceptable that this makes runserver slower to use when for example developing while tethering over a bad cellular connection (which I often do). |
The DEP states:
and then:
and then:
However in-between there isn't much discussion as to whether 'community size' One alternative would be to estimate what percentage of the top N websites (by In fact, with a few minutes of Googling I can find two websites that already I'd like to see the DEP consider several metrics (including non-invasive |
|
||
- A unique Django analytics user ID, e.g. ``3fa04034-a36b-11e6-acd6-acbc32c6febd``. | ||
This is generated by the Python standard library function ``uuid.uuid1()`` and | ||
stored in ``~/.config/djangoanalytics`` (or equivalent on non-Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I work on several Django projects mounted in VM's, and rebuild these VM's every few weeks. With this tracking scheme I would be counted as a new developer once for each project and every time I rebuild any VM. This problem would be compounded with Docker setups where developers might rebuild their containers several times a day.
Rationale | ||
========= | ||
|
||
The high-level rationale is explained in the `Abstract`_: gather data that we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor edit: remove extra space between that
and we
There's little point in gathering metrics if they're disabled by default. Most people will stick with the default (basically hit enter-enter-enter-enter until Django stops asking questions and move on). Then we're back to the kind of results a community survey would provide. We already know that's insufficient to convince potential sponsors (as explained in the DEP). Thomas Goirand's comment suggests that vocal activits will do whatever needed to make downstream distributors strip the metrics code, for philosophical reasons that aren't open for discussion. As far as I can tell, it's not about how Django gathers metrics, it's just about the principle of gathering metrics. How that hurts Free Software isn't a consideration. I'm afraid it's a holy war (dare I say jihad?) and it will be hard to escape. Django's liberal licensing allows distributors to change the code. Certainly we don't want to get into a trademark fight à la Firefox vs. Iceweasel over this. In my opinion that'll just be another reason to encourage users not to install Django with system packages. We already believe virtualenv/pyvenv is the better option for technical and practical reasons. It's sad that we'll sound like we criticise distributors when we have to explain why metrics are underevaluated. I guess that's life. On a more constructive note, I think that describing more precisely what we want to measure and giving more control to developers who want to trigger metrics in the right circumstances would be a good thing. |
@aaugustin There's some truth in what you say, but please don't use big words like jihad. Debian with its policy provides security to its user because precisely we bring some third-party review between upstreams and themselves. Phoning home is bad enough, but if you opt to use Google, then I'm pretty sure it won't fly. It's a matter of trust... Google will have access to data that they claim they will not store. Django has no way to verify that. I made a suggestion a few minutes ago, I would like to hear your thoughts on it instead of being considered as a member of an extremist project. I am particularly interested by the problem of funding free software and I agree with you that the metrics would be useful. But even more useful would be a database of Django developers and Django-using companies. So why not build a service to manage such a database and make it trivial to feed that database from the command line client? Such a service would be useful not only to Django but to the wider free software ecosystem. I would likely (try to) deploy it for Debian too... |
N.B. there's actually such a database already at https://www.djangosites.org/ listing 5179 websites, though it's a third-party project for which the code hasn't been updated since 2013. |
@rhertzog I appreciate that you're open to discussion and constructive. For the avoidance of doubts, I believe that the majority of Debian Developers are also willing to have a productive discussion. It's less clear to me that the majority can prevail on such discussions on the Internet; that's the risk I'm afraid we have to prepare for. Regarding your suggestion, I agree that we need to give specific examples of what the user experience looks like. I'm less enthusiastic about the idea of "registering"; you shouldn't have to register anywhere to use Django. The current proposal says:
Perhaps it could look like this?
@rhertzog: is that the sort of "more visible and more intrusive" thing you had in mind? @jezdez: do you think that would be a sufficient opt-in to alleviate that part of your concerns? I can't say if that's compatible with what @jacobian has in mind. We'll need his input when he's back from Django: Under the Hood and has caught up with the comments. |
The other idea -- gathering a list of companies using Django -- was briefly discussed during the Django: Under the Hood sprints but ruled out for privacy reasons. We don't really want to know who's using Django and we're not looking forward to being responsible for a list of websites that you can hack if you find a zero-day in Django... |
@aaugustin Thanks for breaking down the list of opt-in/opt-out options like that. One question I have: options 3-5 only make sense when there is some kind of interaction with the user. In automated deployments, the opportunity for obtaining the user's consent would be lost. This pushes the scope for the assent mechanism back to something else, for example the Django settings, and there we face more or less the same list of options a different context, with the additional difficulty that the giving or withholding of assent is much less obvious when it's tucked away in a If using settings is less satisfactory (especially for assent by default), then should we also use interactive prompt opt-in? Then we'd be maintaining two different mechanisms. Which would override the other, and how would the user or developer be sure of their state? |
My first thought is that Also there's the issue of backwards compatibility. AFAICT the commands currently targeted by this proposal are non-interactive. Making them interactive by default (unless |
One more option, not sure if you considered this. If you need to get the data about Django usage, you can try to partner with companies which already have the data. This is quite common for developers to monitor their production environments with tools like Sentry or New Relic. Some people are hosting in environments like Heroku, where it's pretty easy to know their stack. Maybe they won't mind to share anonymous total stats? We could ask @dcramer for example. Of course this won't give us absolute numbers of all Django installations in the world, because some sites are not using any monitoring at all. However if we look at the data maybe we could find something useful in it basing on relative estimates, like popularity of Django vs. framework X, and using open data about framework X get an estimate of Django installations. Benefit of this approach - no need to add anything into Django, no need to host own infrastructure, no need to ask people to trust anyone else (Google?) besides those whom they already trusted. |
@aaugustin |
We don't have any numbers off hand in Sentry but these days we probably have accurate enough data to be able to identify many things. If it's something the Django team wants we would be happy to help, though it's possible id ask the Django community to write the draft script for the answers wanted. |
@evildmp Divio's DjangoCMS has shown message about 3.4 being out on recent project's admin week ago. |
@rafalp It's not django CMS doing that, it's django CMS Admin Style, an optional package. |
Sorry to jump in late into the discussion. I think, regardless of what kind of tracking happens, tracking using Google is going to be more contentious than tracking by itself, and I would be very much against using any Google Analytics tracking (it would likely cause me to move to another framework). Tracking with another service, either open source or commercial, is more okay with me, and I would inquire to different analytics services since I'd be surprised if there isn't one that would offer free/reduced rates for an open source software project and initiative. I also don't like the idea of tracking within the application itself. Tracking the Finally, will this data actually generate more fundraising opportunities for Django? I can understand the want for more data but have (potential) investors specifically stated that not having usage metrics causes them to be less open to investing? I'm curious because for a Python developer, Django is one of the "household names" and any developer would recognize it immediately and many have probably used it at one point in their development if they've ever done any web app work and so I'm surprised that the name alone wouldn't be enough for investors where data would. And I'd be doubly surprised that Pypi statistics aren't enough for investors, unless they specifically ask for usage metrics and not an easier to obtain number like downloads. In other words, is this a solution to a problem that does exist? |
@LegoStormtroopr without more detail my answer to many of your questions in the survey is "I don't know". |
@LegoStormtroopr ❤️ Thanks for taking a constructive approach here. |
@LegoStormtroopr The results look all nice and well (to some extend, I am currently having a hard time grasping why the second diagram has less responses -- does google collect partial answers?), but please provide raw access to the data (of all questions), otherwise the usability of this data is quite limited (to quote British Prime Minister Benjamin Disraeli: "There are three kinds of lies: lies, damned lies, and statistics." -- please don't read to much into that sentence though, I am not saying you are lying, I'd just would like to make a picture for myself out of raw data instead of diagrams which are somewhat biased to what you want to show). I also disagree with the math to calculate the results, this is a highly optimistic calculation. Further more, what you easily clarify as "50% overhead" is where it comes really important. There are legal ramnifications to consider as well as how to run the program and how to do the certifications. Without a clear proposal there, your suggestion is not going to provide a viable alternative. |
@nirgal I wish to apologize for losing some of my temper yesterday.
On Monday 05 December 2016 15:45:03 nirgal wrote:
@shaib: paranoid delusion? Is this judgement constructive? "nothing will be
reported without your explicit permission." I would like that very much...
but several people are talking about an opt-out option...
I grant that this was, indeed, suggested. However, as far as I could see, all
the people who started out supporting opt-out have already come around, with
one notable exception -- the DEP author. Leaving aside for a second the issue
of consensus over any tracking at all, I see a rough consensus forming against
opt-out. So yes, I can say with a high level of certainty, nothing will be
reported without your explicit permission.
|
On Tue, Dec 6, 2016, at 11:42 AM, Samuel Spencer wrote:
@apollo13 I was working on that. But I'm out of spoons for this whole
thing. Good luck. I'm out.
Sorry to hear that, is it to much to ask if you could just send me the
raw data you gathered? Would be a shame to let it rot somewhere given
that it exists.
|
The collection of metrics could be limited for a specific period of time. After that time it will be validated to figure out if the metrics did help raising new funds. Then it can be decided to continue with collecting the metrics or not. If this is something that we will do the following points should be added to the proposal:
|
On 2016-12-05 04:29:08, Aymeric Augustin wrote:
For the sake of clarity, I've been repeatedly called out for arguing strongly in this thread, both privately and publicly. I plan to continue answering aggressive comments with a comparable level of energy.
I believe, on the contrary, that we should be "conservative in what we
send and liberal in what we accept". We'd all be blind if we follow "an
eye for an eye" attitude.
I have said before, elsewhere, that I was impressed as to how well this
discussion was going, and how it was showing a great maturity of the
project that people were capable of arguing politely such a sensitive
issue... I would hate to be proven wrong.
Let's assume good faith, people, we're all in this together.
|
@anarcat Of course this isn't a contest. Escalation should be avoided. I think that aggressive or out-of-place comments should get a firm response. I don't think insults should be answerd with insults. |
Atom is another example of prior art, which many Django devs may be using already. See https://github.com/atom/metrics for details. I don't remember drama about this, perhaps because it had metrics built-in from day one. |
Atom metrics were made opt-in in a recent release. The first time you run Atom it asks you if you wanna enable metrics. Ironically, your response is logged even if you decide not to. See atom/atom#4966 and atom/atom#12281. |
As a reference point, here's how the Google Cloud SDK achieves the same goal:
|
This kind of feature would be very useful for open-source projects built with django too. |
|
||
We believe that we've struck a balance that lets us gather the data we need | ||
for sustainability while respecting our users' privacy. And, it'll always | ||
be possible for users to disable this metric collection. We're hoping the vast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/possible/possible and simple/
I just wanted to leave this here as some further information of how others handle this. I was doing some goofing around with .NET Core and ran their new command line tools. When I went to do a
Here is a clickable link: https://aka.ms/dotnet-cli-telemetry .NET Core is open source so it is an open source project gathering data. By running it you are notified that by default it is running. |
*cough* GDPR *cough* |
@benjaoming this DEP hasn't been updated or commented on in over a year, so I don't think it has the traction to actually go anywhere. However, will GDPR even apply? I'm not very familiar with it, but doesn't it only apply to businesses operating within the EU? And even if it did apply, I would think the only requirement would be for consent, is that correct? |
While this isn't legal advice, I did run GDPR for Sentry, so consider these my informed opinions. It generally doesn't apply unless you're collecting some kind of identifying information (which is how its also associated with tracking cookies). Even more importantly, these kinds of stats are often opt-in, which would satisfy consent needs under GDPR in the cases that it does contain e.g. contact information. These policies dont usually apply to businesses, but there's a fuzzy connection on if that's the use of Django (you could ask, just like you could ask for consent). With that said I don't think we should use this ticket tracker as a debate on GDPR politics, and if it does get implemented, the maintainers should simply ensure privacy controls are present and up to standards. |
I don't have the energy to pursue this any longer, so as the person who started this I'll go ahead and close it. It's frustrating that even the most mild attempt to collect usage data results in such vitriol, but here we are. I really wish we had better information about who used Django, and how, but I'm just not willing to fight about it. |
Times have changed is what I meant by tossing in GDPR, and as a bystander (reading this for the first time also), I was hoping that the DEP would be closed & rebooted because of that. Good call @jacobian and amazing spirit about getting this far into the discussion and having so many respectable opinions in one DEP! A lot of the thoughts and ideas from almost 2 years ago about for instance opt-out will likely not be presented the same way again today, post-GDPR. But the discussion was great and very enlightening. So that could serve to inform another DEP that perhaps is more of a "minimal set of acceptable, useful, GDPR + Debian Policy compliant analytics (opt-in)". I wouldn't mind starting that work, if people are interested in a more modest approach? |
I want to start collecting some basic usage metrics so that it's easier for the DSF to raise money.
TODO before merging to master:
cid
bit and why we couldn't track a user even if we knew their id