Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry data is NOT anonymous #1179

Closed
vdmkenny opened this issue Feb 19, 2024 · 12 comments
Closed

Telemetry data is NOT anonymous #1179

vdmkenny opened this issue Feb 19, 2024 · 12 comments

Comments

@vdmkenny
Copy link
Contributor

vdmkenny commented Feb 19, 2024

From the documentation:

https://github.com/diggerhq/digger/blob/develop/docs/reference/digger.yml.mdx#L102

collect_usage_data boolean true no allows collecting anonymised usage and debugging data This flag is only supported in Digger enterprise

However, not even an attempt is made on the client-side to anonymise data. In the below example, your GITHUB_ACTOR environment variable is sent to the analytics server.

func gitHubCI(lock core_locking.Lock, policyChecker core_policy.Checker, backendApi core_backend.Api, reportingStrategy reporting.ReportStrategy) {
log.Printf("Using GitHub.\n")
githubActor := os.Getenv("GITHUB_ACTOR")
if githubActor != "" {
usage.SendUsageRecord(githubActor, "log", "initialize")
} else {
usage.SendUsageRecord("", "log", "non github initialisation")
}

func SendUsageRecord(repoOwner string, eventName string, action string) error {
payload := UsageRecord{
UserId: repoOwner,
EventName: eventName,
Action: action,
Token: "diggerABC@@1998fE",
}
return sendPayload(payload)
}

github user and repository names are sent, which is certainly non-trivial information.

Compounded with the fact that the opt-out flag was removed in #1164 (Starting with release 0.4.1) this makes this software very problematic to use in my opinion.

In my opinion the argument made in #1154 that you should pay for the Enterprise Edition if you want the privilege of privacy holds no water.
This is very much a privacy anti-pattern, and I wonder what the GDPR implications of this are.

Can you point me to anything resembling a Data Processing Agreement?

In any case, re-introducing the flag collect_usage_data would be a step in the right direction, but a serious attempt at anonymisation or tokenisation should be made, taking in account the common misconceptions.

For users of digger 0.4.1 in github actions, a current workaround to disable usage reporting is to add this step in the action:

...
    steps:
      - name: Break digger analytics
        run: |
          sudo echo "127.0.0.1 analytics.digger.dev" | sudo tee -a /etc/hosts
      - name: digger run
        uses: diggerhq/digger@v0.4.1
 ...

I should add that I am a small contributor to this repository, and that I genuinely prefer this software over similar solutions such as Atlantis. This issue is not an attempt at starting any drama, but to address incorrect information in the documentation, and related privacy concerns.

@ZIJ
Copy link
Contributor

ZIJ commented Feb 19, 2024

Hi @vdmkenny thanks for raising the issue as well as #1181. We understand your concern, and this is very much an ongoing discussion internally. I also should probably add that none of us at Digger had experience leading or maintaining significant open-source projects prior to this one; so we kind of learn as we go. Let me share a bit more context.

We raised #1154 because over the last couple months we noticed a significant sustained increase of user activity in Slack and GitHub Issues, without a corresponding change in usage recorded. So many users have turned analytics off. The consequence for us is that we don't know who uses the tool, and how it is used. A second-order consequence is that we only know about a problem when someone raises it on Slack or in GitHub, which in reality happens only for a tiny fraction of all issues. We had a few moments like "ooooh this is why they do this or that" - only after someone reached out to us and explained their needs. I wish more users did that; but in reality very few do. With analytics we could have known instantly, but it took months instead.

We are an early-stage startup, 3 people, building this tool full time. We have a finite amount of dollars in the bank, and the figure becomes smaller every day. To us it is a race for time: either we manage to make it a mature enterprise-grade commercial open source product in the time we have left, or we don't - in which case we'd need to hand the maintenance over to the community and look for other sources of income. Does the community win or lose from the second outcome?

Here's a thought experiment. Imagine there were 2 Diggers in 2 parallel universes. In one Digger collects analytics; in another it doesn't. In the one where it does, a fraction of people is frustrated, perhaps rightfully so. The tool does not meet their needs. But the tool also evolves quickly based on the data from people who are willing to share it. In the other universe where Digger does not collect usage data, it evolves much slower - because in reality whenever there's a button "more privacy" most people are going to press it. Now fast forward 2-3 years. Which universe ends up with a better product?

One option we are considering is to bring back anonymisation, similar to what's in #1181 but opt-in. Something like this:

  • ANONYMIZE_USAGE flag set -> analytics is anonymous
  • Not anonymous otherwise

Again this is very much an ongoing discussion. Thanks for your detailed input!

@vdmkenny
Copy link
Contributor Author

vdmkenny commented Feb 19, 2024

Hi @ZIJ , I certainly appreciate the fact that you're trying to monetise Digger while building a tool that works for all people in all situations.
I introduced Digger to the organisation I work early on in the development, last summer. I am certainly a believer in the project. Since then, our eyebrows have been raised a couple of times by some decisions, this being the most recent one.

I'd say that actual useage and people not wanting share telemetry do not grow in a linear way. Early adopters are more likely to turn telemetry on as a show of support, the general populace less so. I don't think the reaction to this should to be force-enable it anyway. In my opinion any telemetry should be opt-in, and anonymisation should be the standard, not an opt-out feature. I don't really see the benefit of having actual usernames, as having a token to group events to understand the context of an error or see a useage pattern should be what you are after.
In my opinion, the problems you should tackle are the ones addressed on github and by your EE users, even if these come from a vocal minority of your users.

Personally, my main concern of having this kind of identifying information is that it makes me very uncomfortable that your useage logs could be used to assemble what usernames work for what organisation. Especially so if these people are working under NDA or on sensitive projects.

Initially when we set up this project, I was under the incorrect assumption that removing the line collect_data_usage: true from the example digger.yml would disable telemetry. We actually only recently figured out that it was on by default.
I actually just went back to the tag for the initial version we deployed ourselves to look at the old documentation (v0.1.32), and it seems at this time you were actually using the same type of sha256 function to anonymise user IDs as I introduced with my PR. Why the choice to revert this to plain usernames? This is really puzzling to me.

https://github.com/diggerhq/digger/blob/v0.1.32/pkg/usage/usage.go#L37

Again, I understand the need to have this feature, even if I am not a fan of it.

Forcing the "freeloaders" to send telemetry because they're not paying for EE, less so. In my opinion, basic privacy to use a tool should not be kept behind a paywall.

Focusing on the technical aspect, however, I would appreciate it if there was a documentation page going into detail about the actual usage of the data, and also the retention. To me, it seems really user-hostile that you need to crawl through the code to find out that "anonymised" actually means "not so much". At least it would add some transparency.

#1181 certainly wasn't meant as a jab at the company policy or as a joke. I do hope these changes get considered, as it is what I would like to see from this project going forward.

@alberttwong
Copy link

@the-xentropy
Copy link

Oof. I understand the logic here but telemetry is a literal no-go in sensitive deployments and I can't use a system that fails open and ends up sending telemetry if there's a bug or regression. Whether it's anonymized or not doesn't even enter the equation.

@simonfelding
Copy link

simonfelding commented Feb 20, 2024

What the hell. This is illegal and it's not an accident either. Reporting the project as the spyware it is - you're intentionally doing this, while well aware it's massively illegal? Wow.

And you also refuse to provide a data processing agreement? Icing on the illegal theft of data cake.

You can report abuse following this link. This behavior should not be tolerated. https://github.com/contact/report-abuse?report=diggerhq

@ZIJ please get me in touch with your Data Protection Officer ASAP. I would like to request a copy of all the data you have from me. I also require it deleted immediately as per the GDPR, which you are required to comply with as you're distributing your software (and stealing data off people) in Europe.

@ninelore
Copy link

@ZIJ just wanna emphasize that this is a serious GDPR violation and that you should act immediately if you dont wanna get sued into oblivion, which might be too late already

@ZIJ
Copy link
Contributor

ZIJ commented Feb 20, 2024

All - thank you for raising this concern and explaining the nuance in great detail. We are clearly in the wrong here, there’s no way around that. At first we refused to believe it, but asking on HN and Reddit only confirmed what you guys told us in the first place. Lesson learned.

Specifically, we learned that:

  • Not anonymising telemetry is not OK
  • Not allowing to opt out from any telemetry is not OK

The change that caused the rightful frustration has now been reverted in #1184. It reintroduces a flag to disable telemetry (renamed to TELEMETRY), adds anonymisation, and explicit clarifications on telemetry in the docs (in readme, reference and how-to).

We stopped short of making telemetry opt-in, because in practice no one is going to bother to enable it. Doing so would simply kill Digger the company.

@vdmkenny @Ytrog @soapingtime @BundleOfJoysticks @PaulDotSH @kieranbrown, @simonfelding @wouterw @ninelore @matthiasantierens @the-xentropy @alberttwong @hwcltjn

Thanks again for sharing your feedback and helping us learn.

@ninelore
Copy link

@ZIJ Can you also please confirm if data collected so far has been deleted or anonymized retrospectively?

@ZIJ
Copy link
Contributor

ZIJ commented Feb 21, 2024

@ninelore We are working on that and will update here when done

@sikha-root
Copy link
Contributor

sikha-root commented Feb 21, 2024

Was refraining on commenting when this was opened but since it's attracted the eyes of people who don't even use this...

I maintain a private fork and within hours of 0.4.1's release (so before this issue) I had cut my own release, as part of the cadence of staying up to date with upstream, where telemetry was permanently disabled, in response to the changes that were introduced. (I was previously utilising send_usage_data so didn't bother blackholing the function until then.)
Those changes weren't welcome, but forking is definitely an option, especially if you're self hosting in the first place.

The people at Digger are by no means part of a big company (think about it: would a consultant have their own DPO???), so it would be ideal if certain people cut them slack for anything that resembles Hanlon's razor. No one is perfect, and they're responding in a reasonable manner, I think?

@vdmkenny
Copy link
Contributor Author

@sikha-root

forking is definitely an option, especially if you're self hosting in the first place.

Technically yes, however the open source community can be fragmented enough, to me it seems worth it to take concerns about a project upstream first.

The people at Digger are by no means part of a big company (think about it: would a consultant have their own DPO???)

I don't really agree with this comparison. If the consultant had his own company that did data collection, he would 100% have to fulfill DPO duties. If he company he consulted for did, that company would have to. I know GDPR and data privacy can be a complex task, and many companies are doing it in a "best effort" style, but it's really no excuse at any scale if you're collecting data to say you're too small of an operation to care about privacy. (Not that this was the response here)

so it would be ideal if certain people cut them slack for anything that resembles Hanlon's razor. No one is perfect, and they're responding in a reasonable manner, I think?

Some of the responses here and on reddit were a bit dramatic and in bad faith, I agree.
I'm positive about the statement from @ZIJ, and to me it seems like there will be no need for a public fork in the near future.

I'm eagerly waiting for a new release and related communications.

@UtpalJayNadiger
Copy link
Contributor

Data has now been anonymised retrospectively. The changes made in v0.4.1 were also reverted in v0.4.2.

Thanks again for helping us take the right decision. This issue will now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants