New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding machine ids to telemetry #3494

Closed
wants to merge 8 commits into
base: rel/1.0.0
from

Conversation

Projects
None yet
10 participants
@adamgorMSFT
Copy link
Contributor

adamgorMSFT commented Jun 10, 2016

Currently collected telemetry has no ability to accurately differentiate users or machines.

Adding a hashed machine ID to collected telemetry to improve ability to differentiate distinct machines(users)

@@ -54,7 +54,8 @@
"Microsoft.Win32.Registry": {
"version": "4.0.0-rc3-24201-00",
"exclude": "compile"
}
},
"System.Net.NetworkInformation": "4.1.0-rc2-*"

This comment has been minimized.

@eerhardt

eerhardt Jun 10, 2016

Member

We do not want * versions in our project.json files. Especial "rc2" versions. Please use the current CoreFX version.

This comment has been minimized.

@adamgorMSFT

adamgorMSFT Jun 10, 2016

Contributor

changed.

PhysicalAddress address = adapter.GetPhysicalAddress();
byte[] bytes = address.GetAddressBytes();
macs.Add(string.Join("-", bytes.Select(x => x.ToString("X2"))));
if (macs.Count() >= 10)

This comment has been minimized.

@eerhardt

eerhardt Jun 10, 2016

Member

Shouldn't use .Count() here - List has a .Count property, which is much faster.

This comment has been minimized.

@adamgorMSFT

adamgorMSFT Jun 10, 2016

Contributor

changed


PhysicalAddress address = adapter.GetPhysicalAddress();
byte[] bytes = address.GetAddressBytes();
macs.Add(string.Join("-", bytes.Select(x => x.ToString("X2"))));

This comment has been minimized.

@eerhardt

eerhardt Jun 10, 2016

Member

Can't we just use UTF8 or some other encoding to turn these bytes into a string? It would probably be much faster.

This comment has been minimized.

@eerhardt

eerhardt Jun 10, 2016

Member

Also, looking below, we are just turning byte[]s into strings here. Then later, turning those strings back into byte[]s. Why not just return List<byte[]> from here, and skip the back and forth?

This comment has been minimized.

@adamgorMSFT

adamgorMSFT Jun 10, 2016

Contributor

The reason they are getting turned into a string and back again, is an attempt to match and hash a specific format (a string in form of FF-FF-FF-FF-FF-FF) in as most similar way that VsCode hashes their mac address.

Essentially I am trying to duplicate the following
VsCode hashing
e.g. result = crypto.createHash('sha256').update(macAddress, 'utf8').digest('hex');
Which gets the macAddress from a string regex match in the above form via getmac or ifconfig

If there is a way I can do this in a better way without the back and forth. I'd love to improve it, I am just not sure how to do this any other way.

This comment has been minimized.

@eerhardt

eerhardt Jun 10, 2016

Member

Why would we care if the string is in a specific format? It is hashed, so it should be 100% opaque to us.

IMO - it should look like this:

private List<byte[]> GetMacs()
{
    ... same as above only add address.GetAddressBytes() to the 'macs' list
}

       private List<string> HashSha256(List<byte[]> values)
        {
            var sha256 = SHA256.Create();
            var hashedStrings = new List<string>();
            foreach (var value in values)
            {
                byte[] hash = sha256.ComputeHash(value);
                StringBuilder hashString = new StringBuilder();
                foreach (byte x in hash)
                {
                    hashString.AppendFormat("{0:x2}", x);
                }
                hashedStrings.Add(hashString.ToString());
            }
            return hashedStrings;
        }

This comment has been minimized.

@adamgorMSFT

adamgorMSFT Jun 10, 2016

Contributor

I have added a comment in the code to indicate the need for a specific format.

@PinpointTownes

This comment has been minimized.

Copy link

PinpointTownes commented Jun 12, 2016

Am I really the only guy having a problem with all these crazy telemetry-related PRs?

Just like command line arguments, sending machine/environment-specific things like MAC addresses sucks really hard from a privacy perspective, even if they are hashed.

If you really want to be able to correlate collected traces, why not simply generating a unique identifier when installing .NET CLI and storing it somewhere in the user profile? Not perfect, but still much better than using MAC addresses for that.

@guardrex

This comment has been minimized.

Copy link
Contributor

guardrex commented Jun 12, 2016

@PinpointTownes You're not the only one upset about these developments. It doesn't bother me that the program exists and provides MS with basic usage information, but it does bother me when sensitive, machine/location-identifiable information is sent to MS and the program is still being operated covertly.

The combination of ...

  1. Automatic activation of telemetry with no notice to the dev/sysop, and
  2. No explicit notice in the installer (or option to set the opt-out env var on install)

... makes the way that the program is being managed dangerous to Microsoft's objectives.

@benaadams

This comment has been minimized.

Copy link
Contributor

benaadams commented Jun 12, 2016

Just store an reuse a guid? So its specific but more anonymous

@Rutix

This comment has been minimized.

Copy link

Rutix commented Jun 12, 2016

I have to agree with the ones before me. If the only purpose really is only to correlate data, storing a guid and using that would be way more privacy friendly than using the MAC address.

@attilah

This comment has been minimized.

Copy link

attilah commented Jun 12, 2016

It should be OPT_IN and not OPT_OUT. See the case of the Visual C++ team.

https://www.reddit.com/r/cpp/comments/4ibauu/visual_studio_adding_telemetry_function_calls_to/

Usually people takes telemetry as spying on them.

@Yantrio

This comment has been minimized.

Copy link

Yantrio commented Jun 12, 2016

Why does microsoft need to identify distinct machines ? Is there any documentation or discussion we can see to find out how and why this conclusion was drawn?

@benaadams

This comment has been minimized.

Copy link
Contributor

benaadams commented Jun 12, 2016

@Yantrio I'd guess because ip address isn't a very good measure as you will get large activation clumping behind NATs/gateways/proxies. Much like you use a cookie in a browser rather than ip address to derive any sensible website per user usage stats.

@Yantrio

This comment has been minimized.

Copy link

Yantrio commented Jun 13, 2016

My question isn't so much as to why it's required to use mac addresses vs IP, it's more as to why at all? is it important to know what a specific user is doing?

@adamgorMSFT adamgorMSFT changed the title Adamgor/telemetry machine ids telemetry machine ids Jun 13, 2016

@adamgorMSFT adamgorMSFT changed the title telemetry machine ids add machine ids to telemetry Jun 13, 2016

@adamgorMSFT adamgorMSFT changed the title add machine ids to telemetry Adding machine ids to telemetry Jun 13, 2016

@adamgorMSFT

This comment has been minimized.

Copy link
Contributor

adamgorMSFT commented Jun 13, 2016

@Yantrio, as already said in the other discussions linked below, "We're only interested in aggregate data that we can use to identify trends". Aggregated data can be quite useful for many different reasons. For example, it can help engineers prioritize features that will make the product better.

Also, there are several other discussions and documentation on this and other topics mentioned.
#2145, #3093, #3404
https://blogs.msdn.microsoft.com/dotnet/2016/05/16/announcing-net-core-rc2/#telemetry

@PinpointTownes

This comment has been minimized.

Copy link

PinpointTownes commented Jun 17, 2016

@Yantrio, as already said in the other discussions linked below, "We're only interested in aggregate data that we can use to identify trends". Aggregated data can be quite useful for many different reasons.

That doesn't explain why a random identifier wouldn't work in this case.

For example, it can help engineers prioritize features that will make the product better.

You have GitHub, Uservoice and a bunch of other channels to hear what your community needs or wants 😄

@benaadams

This comment has been minimized.

Copy link
Contributor

benaadams commented Jun 17, 2016

You have GitHub, Uservoice and a bunch of other channels to hear what your community needs or wants

That doesn't capture what people use most, only what a louder subset talks about most :)

@adamgorMSFT

This comment has been minimized.

Copy link
Contributor

adamgorMSFT commented Jun 17, 2016

@PinpointTownes, Github and other community/user feedback sources are indeed leveraged quite a lot already. Though they typically convey a different set of information. It is more informative and accurate to have a complete picture from multiple sources, instead of just 1. Having just 1 source can sometimes paint an inaccurate and biased scenario.

You have GitHub, Uservoice and a bunch of other channels to hear what your community needs or wants 😄

Proof that the community feedback already influences decisions; you mentioned previously about "command line arguments". They were ultimately dropped from that earlier pull request. The decision to drop it was largely influenced because of community response.

That doesn't explain why a random identifier wouldn't work in this case.

Though it would for the core usages, it won't for all. But there are other scenarios it doesn't and that is a contributing factor why its done this specific way. For 1 example, there is interest in VsCode/CLI correlation, as they are both popular to be used together. So if we want to correlate against that already collected data, that would be 1 reason to collect a comparable value.

@richlander

This comment has been minimized.

Copy link
Member

richlander commented Jun 30, 2016

Thanks everyone for the feedback. I'm closing this PR now. I'll tell you why.

A little bit of context first. We added telemetry to the .NET Core Tools in RC2. I appreciated the feedback that folks made on that, which helped us to create a better product. #1 focus on the telemetry front is sharing usage data with you. We will not even consider making any more changes until that job is done. That's a promise.

Now, to this change. You can see that the change came in 20 days ago. This was during the height of shipping .NET Core 1.0. As a result, the PM team was super focussed on shipping 1.0 and didn't talk to the folks working on this PR. This is a poor excuse. I know you expect more from us.

With 1.0 out the door, we're going to take a closer look at our telemetry plan, both in terms of the actual telemetry and community engagement. We need telemetry for this product, however, "community engagement by PR" is not working. We need to adopt a different model for engagement. This is your .NET, and that needs to be more obvious from our engagement.

I ask for your continued engagement on this topic and to extend your patience a little bit longer.

Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment