New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for deletion #32

Closed
steipete opened this Issue Feb 18, 2016 · 80 comments

Comments

Projects
None yet
@steipete
Copy link

steipete commented Feb 18, 2016

Hi,

I'm getting a ton of emails for surveys/research around software, and it's really annoying. I recognize the scientific effort and initially helped many people here, however after so many requests, I feel that I did my part and would like to opt out. Can you remove all of my data/emails from this giant data blob?

Thank you.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 18, 2016

Please remove my data, too.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

👎 - There are a few reasons why this is a bad idea:

  • Removal from the GHTorrent dataset won't get rid of your data. Both of your email addresses are still publicly available on GitHub. @steipete has his in his profile (see below), and while @futuretap doesn't, it's still in every single commit (here's the first commit I could find, for example).

    1456479469

  • GHTorrent is ~6TB of data right now, which would all need to be processed and re-released. Not only is that kind of processing impractical for ad-hoc requests, but everyone who has already downloaded the datasets would need to be found (somehow), contacted, and asked to redownload the cleaned versions.

  • Removing people would introduce errors into the data-set - like issues or PRs created by non-existent people. These kinds of errors would likely prompt researchers and archivists to explicitly seek out unaltered data through other means (see next point).

  • GHTorrent isn't the only dataset, it's just a centralized copy. Anyone can run their own copy of GHTorrent and make their own dataset. In fact, it's not even the only project - githubarchive probably has a copy of your data too and I'm sure that archive.org has a copy somewhere.

So, I think you'd have much better luck just switching to a throwaway email for your GitHub accounts & git commits. Also, it might be a good idea to establish a list of people who do not want to participate in GitHub-related research projects. Complying with such a list would obviously be voluntary, but I don't think that many researchers want to annoy. So, if ghtorrent.org added a note about the existence of a list to important locations like the downloads page, I bet it would be more effective at solving this than data-removal.

@steipete

This comment has been minimized.

Copy link
Author

steipete commented Feb 26, 2016

It's not about removing my email address completely - it's all about making it less convenient for people to get a complete set of scraped data, that they can use to send everyone spam.

My email is public and that's fine, but there's a difference between people browsing my profile and deciding to ping me and people sending emails to tens of thousands of emails because they have one convenient data set and it doesn't cross their minds that other people could be offended by the sheer amount of spam emails this creates. The problem is that it's not something GMail can filter automatically since these are legitimate emails - just way too many.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

Yes, it's all about making it easy or not for spammers. Your database just makes it way too easy. So please remove my address. You may well replace it by something like no@address if you care about consistency. Also, I don't care about contacting past downloaders. If they are spammers, they wouldn't comply anyway.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

[Apologies for the late reply, I 've just seen this]

@steipete I realize that this may be inconvenient, I am also getting quite a few of those emails as well. However, "opting out" is not really something realistic; as @slang800 also writes, this is public data that GHTorrent just collects; Google, Bing, GitHub Archive, GitHub's own search and various other sites that link to Github also do the same and you can readily query them for developer profiles.

Personally, as a researcher, I think that having access to such a vast pool of interconnected data is amazing. As an individual, I don't like much that other people are building a profile of me by judging my OSS activity. However, I "opted-in" when I shared my email and agreed to GitHub's terms, so I should have been more careful.

I like @slang800's suggestion and therefore I will create a page at GHTorrent (top-level) so that people who do not agree with being contacted about research can be listed there. It is however up to the researchers to comply. As far as I understand research, most researchers will be happy to comply.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@futuretap I would hardly call researchers contacting developers for input "spammers".

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

Honestly, that opt-out page is an even worse idea. Spammers will love this list even more so it exposes my address even more.

The difference to Google and Bing is you can't ask those search engines: "Give me a list of thousands of email addresses to spam to". So I stand by my opinion that exposing email addresses like this is a terrible idea. Why don't you remove just the email column? Interesting researchers could still find out the addresses manually but you wouldn't create a spammer's heaven database.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

@gousiosg how can you be sure that only researchers use the data? In fact I'm pretty sure, most abuse is not being done by researchers.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@futuretap I cannot remove the email column because it identifies devs that do not have a GitHub account and links them with their commits.

WRT the white list, we can have the white listed people's email replaced by something like 'no-spam@ghtorrent.org'. I do need to have a whitelist so that GHTorrent actually consults it when it is processing updates.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@futuretap WRT your comment on search engine usability

screen shot 2016-02-26 at 12 26 44

curl 'https://api.github.com/search/users?q=language%3AObjective-C%20followers%3A%3E%3D150&order=asc&sort=followers&per_page=100&page=1'|grep futureap
@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

I'm a bit confused. You're calling this opt-out list a "white list"? I'd call it a black list.

Anyway, the point is, when opting out, I'd like to be exposed less not more with my email address.

So do I understand correctly that when opting out, you'll replace the email in the database with something like 'no-spam@ghtorrent.org'? Then I'm all in!

WRT your search engine comment, I don't understand your message here. This search thankfully doesn't include email addresses. But your database does.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

The difference to Google and Bing is you can't ask those search engines: "Give me a list of thousands of email addresses to spam to". - @futuretap

Au contraire, mon ami - that's exactly how spammers build their initial lists. Check out http://www.binarytides.com/email-harvesting-metasploit/ for a tutorial.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

I think it's moot to discuss spammer techniques here. I still think it's reasonable to ask you to remove my address from your database.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@futuretap Ok, for the white/black list, I am glad you agree!

WRT to GitHub's search, let me elaborate a bit:

curl 'https://api.github.com/search/users?q=language%3AObjective-C%20followers%3A%3E%3D150&order=asc&sort=followers&per_page=100&page=1'|
grep '"url": '|sed -e 's/"url": "\(.*\)",$/\1/'|
while read url; do 
    curl -s $url
done |
grep email

With a little more effort (and a few more API keys) I can collect thousands of emails in a couple of hours. My point is that it is pointless to remove your email from GHTorrent, but since you insist, I have done this already (also for @steipete). You can confirm this here: http://ghtorrent.org/dblite/

select * from users where login='futuretap'​​​
@milosonator

This comment has been minimized.

Copy link

milosonator commented Feb 26, 2016

You can do that but it will be extremely ineffective, the spammers clearly already have it, and asking to remove it is like asking to remove your phone number from the new phone-books when your phone number is already all over the internet, ready to be harvested by anyone who can write some bash, as @gousiosg displays here.

If you want your e-mail to be more private, remove it from GitHub altogether to prevent future spammer to grab it.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

I'm pretty sure, most abuse is not being done by researchers. - @futuretap

What makes you think that?

When I initially replied I assumed that you were talking about emails from researchers that explicitly say they found your info from GHTorrent. But if you're assuming that spammers have been finding your info from this dataset, then I think you should consider how much easier it is to just scrape GitHub itself and avoid the huge download... Or maybe just have Google scrape it for you, since it seems like they cache the email address too:

1456487733

I think it's moot to discuss spammer techniques here. - @futuretap

Perhaps, but my point is that GHTorrent isn't unique in this problem... It sucks if spammers are using this dataset, but there isn't a decent solution... not even search engines have found a good way to protect emails from spammer harvesting.

@samdmarshall

This comment has been minimized.

Copy link

samdmarshall commented Feb 26, 2016

I second that @futuretap, there should be a way to opt out of this service. The amount of "oh I found you on Github" Mail I've been getting lately is incredibly annoying.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

Researchers have been contacting devs before GHTorrent; in fact half of the research papers being published mention Github Archive as a source of dev data. As I demonstrate above, it is easier to just scrap GitHub's search than download and restore (takes 2-3 days!) GHTorrent's database to collect emails (not to mention the /users API endpoint).

It is 2016; if you expose your email on the Internet, you are bound to get emails (spam or legitimate ones).

@steipete

This comment has been minimized.

Copy link
Author

steipete commented Feb 26, 2016

It's really not classial spammers but students and researchers who are completely oblivious to the problem and consider the existance of this project als legitimicy to email=spam everyone.

The last time I tried to explain the issue, the researcher suggested to sending an email to everyone, asking if they are getting too many emails from this source. By then I just wanted to scream.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

Thanks @gousiosg for the removal. If you're serious about fighting spam, you might want to reconsider publishing the harvesting script here. I get your point that it's easily doable also without you publishing the detailed spammer harvesting recipe.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@futuretap If I can do it in 5 mins, one has to assume that any developer can do it. Security (or privacy) through obscurity does not work.

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

As @steipete said, it's more about dumb students and researchers than professional spammers.

@steipete

This comment has been minimized.

Copy link
Author

steipete commented Feb 26, 2016

I opened this issue with the hope that it might cause a change in mindset. Just because it's easy to ring on all doors on the street doesn't mean you should do it.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

it's more about dumb students and researchers than professional spammers

Then shouldn't we try to educate them, rather than punish everyone with an inferior/damaged dataset?

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

Absolutely. However that education shouldn't include spam email harvesting techniques.

@steipete

This comment has been minimized.

Copy link
Author

steipete commented Feb 26, 2016

Yes. You could educate them best by force-pushing this repo to an empty one with a readme explaining the problem. Otherwise I doubt people will read.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

Oh right, because censorship always solves problems with 'dumb' people using information, doesn't it?

@madrobby

This comment has been minimized.

Copy link

madrobby commented Feb 26, 2016

@gousiosg Please remove me as well. I've never agreed to share my data on your website. Thank you.

@steipete

This comment has been minimized.

Copy link
Author

steipete commented Feb 26, 2016

How about adding a disclaimer on the start page of http://ghtorrent.org explaining the issue and asking people to not send unsolicited mass-email. Currently it does not mention anything on this highly relevant topic there.

Some people might believe it's common sense to not send email to a huge list without an explicit opt-in, but apparantly this isn't the case if one can justify it under the research-label.

@mallamanis

This comment has been minimized.

Copy link

mallamanis commented Feb 26, 2016

GHTorrent is a valuable resource for software engineering research and I disagree with any idea of censoring it.

People that may be 'dumb' to send mass emails are not necessarily "technically dumb" to not be able to retrieve emails in another way. Removing any data from GHTorrent will not change the way some people (researchers or not) behave and I feel that it is not GHTorrent's problem to solve. Developing in public implies that all the commits are visible to everyone and by definition there is no privacy in that. If your email has gone public, it is public (and this is your problem). [just to note here: this is the case with my email too. I don't like it, but I accept it]

Philosophical question: @steipete How do you even know that the unsolicited emails you have received from researchers/students were extracted from GHTorrent? For a long time I was using GitHub's own dump in Google BigQuery with GitHub Archive that also contains this data and it should be equally easy to get an email dump from there. I have also received multiple emails because of people that have scraped app store emails. Removing the data from GHTorrent definitely will not solve the problem with email etiquette.

Some suggestions:

  • (as @steipete suggested) A disclaimer: the emails in the dataset is not a mailing list. Researchers should not use the dataset as a mailing list. Individuals cannot unsubscribe from it (because it's not a mailing list).
  • A FAQ: People need to understand that once they use their email publicly (e.g. in a git commit or in GitHub elsewhere) their email might, may and will be found by anyone. GHTorrent or anyone else cannot stop that. The FAQ could explain how to avoid getting your email to appear in public when using git/GitHub.
@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

@futuretap - what is that difference? Would you be guilty of mass-distribution just for mirroring or archiving GitHub? Or does the problem arise when you provide data in an easy-to-access format? Could you get banned just for posting the list of emails in the git log?

@samsonjs

This comment has been minimized.

Copy link

samsonjs commented Feb 26, 2016

@mallamanis @slang800 Email sent to my address comes to my inbox. It's 100% personal. You let me worry about the other places I choose to distribute it. I did not choose to distribute it in this massive list.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

@SpacyRicochet - GHTorrent isn't a researcher, doesn't control any researchers, and isn't sending anyone emails. It's a data-set. I think that we all agree that spamming is bad. The debate that we're having is about whether or not censoring GHTorrent is ethical and/or useful.

I'm actually not familiar with Dutch legal code, are you talking about the "right to be forgotten", or something else?

@neonichu

This comment has been minimized.

Copy link

neonichu commented Feb 26, 2016

Let's also be honest here, the main contributor of GHTorrent is a researcher and is admitting to spamming people using GHTorrent data on their blog: http://www.gousios.gr/blog/Scaling-qualitative-research/

@futuretap

This comment has been minimized.

Copy link

futuretap commented Feb 26, 2016

Interesting. So it looks like @gousiosg is actually violating the @github ToS. If not by distributing the database then by sending those 3,500 emails as he openly admits.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

@samsonjs / @SpacyRicochet: You both did distribute your email addresses. I truly don't understand how you can publish this information on the internet, and then be upset when someone makes it too easy to access.

If I argued that my quote saying "spammers don't care about the ToS" was private information and @SpacyRicochet was infringing upon my rights & redistributing it without my consent by reproducing it in his comment, would you not think I was insane?

@segiddins

This comment has been minimized.

Copy link

segiddins commented Feb 26, 2016

My email needs to be public, for the same reasons Ash pointed out -- largely, people need to be able to contact me with CoC-related issues for the projects that I run. I don't mind my email being public, but I do mind it being aggregated in this way, and am frankly quite bothered by the amount of spam I receive, and the fact that you refuse to remove my personal data from a datastore you control and distribute publicly.

@samsonjs

This comment has been minimized.

Copy link

samsonjs commented Feb 26, 2016

@slang800 I didn't consent to or personally distribute my email in this giant torrent. I disclose it on GitHub. If a spammer scrapes GitHub and emails me then that's a separate issue that has no bearing on this discussion. You don't get to tell me that because I distribute my info in one way then others suddenly have the right to take my info and distribute it in other ways.

@samsonjs

This comment has been minimized.

Copy link

samsonjs commented Feb 26, 2016

@slang800 How about I go to your website and copy posts from your blog. I mean, you published it on the Internet so now it's fair game for anyone to use for any purpose they wish, right? Give me a break.

@slang800

This comment has been minimized.

Copy link
Contributor

slang800 commented Feb 26, 2016

@samsonjs: if they're CC0'd, then feel free to copy verbatim. Otherwise just give attribution and you're good.

Edit: Your email address probably doesn't count as a copyrighted work in the same way that a blog post could, but I'll follow along with the analogy nevertheless.

@mallamanis

This comment has been minimized.

Copy link

mallamanis commented Feb 26, 2016

I believe that you are using GHTorrent as a scapegoat for the spam you are receiving by researchers that do not do research "ethically". We all agree that spam is annoying. We shouldn't be discussing this here. Instead we should be discussing if removing the emails from this dataset (that contains information that is already publicly available to everyone with an Internet connection) will stop those emails from coming (which are indeed very annoying).

As @segiddins and @ashfurrow say the email address is there to solicit emails from a specific subset of people. Unfortunately, no one can enforce a subset on this solicitation with current GitHub's structure. It's not GHTorrent's fault. @github should allow us to set specific permissions to allow disclosing our email (e.g.) only to contributors.

Regarding your previous post: @samsonjs The emails that arrive at your inbox are 100% personal and private. No one else can/should read them or even know that you have received any email or know about any of its metadata. But an email address is not private information once publicly disclosed. (We both chose to publicly disclose this information at some point). I hate getting spam (from researchers or not) but I have to live with it, as I have to live with the unsolicited physical ad brochures I receive at my office. Both these instances of spam do not violate anyone's privacy (but are very annoying).

So, if you buy my argument above, please, lets stop talking about privacy and focus on spamming. And in specific if removing this information from here, will reduce our spam influx (I believe this isn't the case).

Enjoy your weekend :)

@lazerwalker

This comment has been minimized.

Copy link

lazerwalker commented Feb 26, 2016

This is more a case of respect than literal privacy. We all recognize and understand this information is already public on the Internet, and you can technically do what you want with it. Many of us believe that having our email addresses and other data removed from your dataset would decrease the amount of spam. Maybe we're right, maybe we're wrong; what's more important is that you refusing to do this shows an incredible lack of respect and empathy for the people whose data you're scraping.

The Wayback Machine is another high-profile example of publicly-available information being captured and exposed in a way that makes it more accessible. They let you opt-out. They very clearly tell you exactly how to have your site excluded, either by modifying your robots.txt file or just by emailing them (https://archive.org/about/faqs.php#2). Notably, they retroactively apply this: if you block the IA in your robots.txt file today, any earlier versions of your site that previously appeared on the Wayback Machine will be hidden going forward.

Even though their goal is to archive the entire Internet, they recognize that respecting their implicit contributors enough to give them the choice to opt-out is more valuable than having an exhaustive data set or simplifying the technical cost of maintenance.

@alloy

This comment has been minimized.

Copy link

alloy commented Feb 26, 2016

I would like to +1 Mike's first paragraph, which is exactly how I feel about it. As such, please add me to the list of people not to contact.

@ELLIOTTCABLE

This comment has been minimized.

Copy link

ELLIOTTCABLE commented Feb 26, 2016

So, I usually don't want to contribute to a pile-on like this, but this is so crucial:

I will not hash emails because i) it is extra effort for minimal (if any) benefit and ii) will deteriorate the quality of research being done with GHTorrent. People are using emails to link GitHub profiles to StackOverflow, OSS project mailing lists and Jira databases and perhaps other data sources as well. You can have a look at the proceedings of the MSR conference to get an idea what emails are used for.

No. Stop. You don't get to say how much benefit it is to us: your argument that it's “just as easy to get the information elsewhere” isn't fucking enough. We know it's technologically trivial, that's not the point; that's like arguing that “It's okay that I'm forging hundreds of signatures, because it's really easy to forge a signature! Anybody could do it!”

Obscuring the e-mail address in the database (preferably, via a simple hash) allows the researchers to do the same matching of username-to-commit that is currently done via e-mail address [which is a valid goal]; and it may not prevent unkindly-minded researchers from Doing Bad Things, but, (and let me put this in large type, because it's really important,) …

… doing so sends an important message.

It says to the (possibly not-so-concerned-with-us-developers'-time-and-energy) researchers using your database, “hey, if I need to take an extra step to scrape all these e-mail addresses … huh, this, kinda feels like the kind of thing that I hear about spammers doing … wait, does this make me a spammer?” (Hint, researchers reading this: yes. yes it does. You are a spammer. Please stop.)


And, more importantly than refuting your “there's no point to you doing this” argument, there's this one:

This is our data. Not yours.

From your point of view, we may have posted this publicly, in one fora (on a social-networking site, for the use of other developers wishing to contact us about the software work we're explicitly doing with those other developers), but that does not imply an opt-in to such research efforts.

This sort of invasion of privacy? It's why IRBs exist; it's why research subjects are guaranteed full-disclosure and informed consent is sought.


I seriously cannot believe you're arguing about this, if you call yourself a researcher. Where are your scientific ethics? This is a huge privacy, disclosure, and informed consent issue. I'm sorry if this post seems adversarial, but that's because you've hurt me, and you have no right to. Get your head on straight, and remember the ethical foundations of your field as a researcher, please.

@davidschreiber

This comment has been minimized.

Copy link

davidschreiber commented Feb 26, 2016

+1 @ELIOTTCABLE

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

Let me clarify a few things:

  • I have never refused removing individuals' emails from the GHTorrent database. If you bothered to read the whole thread you would see that I have already done so. I also proposed a scheme where individuals would be able to have their emails removed (@slang800 has already started working on it).
  • If something is on the Internet, you either need to provide a license with it or it is in the public domain. This is why licenses such as Creative Commons where invented.
  • To the best of my knowledge, I have not used the GitHub service to "upload, post, host, or transmit unsolicited email" for my research.
  • I do not allow anyone question my scientific integrity, scientific ethics or my research work on grounds other than scientific merit. My research work is fully disclosed on this very site and you are welcome to inspect it. As my 1,500 respondents can confirm, I 've treated their privacy with the utmost respect.
@segiddins

This comment has been minimized.

Copy link

segiddins commented Feb 26, 2016

If something is on the Internet, you either need to provide a license with it or it is in the public domain. This is why licenses such as Creative Commons where invented.

False. By default, at least in the USA, fully copyright is the default. No license may be presumed in the absence of a grant thereof, and public domain is a license.

I have never refused removing individuals' emails from the GHTorrent database

In that case, please remove mine as well.

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 26, 2016

@segiddins I did not discuss copyright, which depending on the jurisdiction may or may not, belong to you. I discussed licensing. If something is public domain, then derivative work is allowed by default, at least according to my understanding of IP law (IANAL).

BTW, I removed you and all other people on this list.

@davidcelis

This comment has been minimized.

Copy link

davidcelis commented Feb 26, 2016

Please remove me as well. Thanks.

@chucker

This comment has been minimized.

Copy link

chucker commented Feb 26, 2016

I did not discuss copyright, which depending on the jurisdiction may or may not, belong to you. I discussed licensing.

Licensing is only relevant because of copyright. If a work isn't copyrighted, it doesn't need to be licensed, as it's effectively in the public domain. These claims of yours:

If something is on the Internet, you either need to provide a license with it or it is in the public domain. This is why licenses such as Creative Commons where invented.

…are way off, and dangerous for a project that so significantly hinges on third parties' rights — privacy, property or otherwise.

@cainlevy

This comment has been minimized.

Copy link

cainlevy commented Feb 26, 2016

There's a world of difference between technically available information and readily available information. The lock on my front door is technically not secure, but in practice it still keeps out the majority of undesired visitors.

This data set makes it easy to infer private personal details that individuals would prefer remain difficult to discover. Please anonymize all research data sets. I would suggest dropping most of the users table, except for maybe country_code. Yes, it reduces the potential uses for the data set but that is actually the point.

Meanwhile, please remove me.

@rastersize

This comment has been minimized.

Copy link

rastersize commented Feb 27, 2016

Please remove any data about me and block my personal data from being included again. Also make sure such blocking is done in a privacy friendly manner.

Why, really because it’s the right thing to do morally and ethically. Respecting me, my personal data and how I want it to be handled. However if that’s not enough, or you don’t agree, I’ll ask you to read the section titled “So why should you comply with my wish of removal?” which details my understanding of the legal aspects.

I didn’t know about this data collection and processing until I was alerted via a random tweet and looked up myself. Finding out that my personal data is being collected and processed. You do not have my consent to collection and processing my personal data. Not cool.

So why should you comply with my wish of removal?

It boils down to personal data protection laws in EU as defined by Directive 95/46/EC of the European Parliament and of the Council. The Netherlands (which GHTorrent is based in according to your public information) has ratified this directive and passed it as “Wet bescherming persoonsgegevens” (Personal Data Protection Act) which went into effect on January 1st, 2001.

I on the other hand is a citizen of a member country of the EU.

You failed to get consent

My personal data should never have been included in the data set. As you never asked for my consent. Article 7 of the mentioned directive makes this clear:

CRITERIA FOR MAKING DATA PROCESSING LEGITIMATE

Article 7

Member States shall provide that personal data may be processed only if:

(a) the data subject has unambiguously given his consent; or
(b) processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract; or
(c) processing is necessary for compliance with a legal obligation to which the controller is subject; or
(d) processing is necessary in order to protect the vital interests of the data subject; or
(e) processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller or in a third party to whom the data are disclosed; or
(f) processing is necessary for the purposes of the legitimate interests pursued by the controller or by the third party or parties to whom the data are disclosed, except where such interests are overridden by the interests for fundamental rights and freedoms of the data subject which require protection under Article 1 (1).

None of these mentioned tests hold true in, at least, my case. I’d guess they don’t hold true in many cases at all but that I technically don’t know, maybe you’ve collected consent (a) from everyone else?

You’re required to remove the data by law

As you failed to make your collection and processing legitimate as per Article 7 you’re required to by law to remove it, see Article 12 of the EU Directive 95/46/EC.

Article 12

Right of access

[…]

(b) as appropriate the rectification, erasure or blocking of data the processing of which does not comply with the provisions of this Directive, in particular because of the incomplete or inaccurate nature of the data;
(c) notification to third parties to whom the data have been disclosed of any rectification, erasure or blocking carried out in compliance with (b), unless this proves impossible or involves a disproportionate effort.

This part is often referred to as the right to be forgotten in the EU.

Article 14 of the directive also gives the subject (me in this case) the right to object at any time.

This is at least my understanding of Directive 95/46/EC. However I’m not a lawyer. Although my understanding is that these protections are going to become even stronger.


tl;dr All in all, you’re violating my rights as a citizen of the EU. As such please delete any personal data about me, block it from being added again and if possible notify all third-parties that they may not use the data about me. These requests are backed by Directive 95/46/EC and Wet bescherming persoonsgegevens.

@cattedoctor

This comment has been minimized.

Copy link

cattedoctor commented Feb 27, 2016

The most interesting thing about this is how your behavior and your project are in violation of ACM & IEEE ethical guidelines. Please consider treading more wisely or those who have been harmed may escalate the issue with the ACM & IEEE.

@madrobby

This comment has been minimized.

Copy link

madrobby commented Feb 27, 2016

I've reported the ghtorrent user as abusive to GitHub, with a link to this thread. The project maintainer clearly doesn't care blatantly stomping on our privacy. I recommend that everyone who wants their email removed does the same.

@ELLIOTTCABLE

This comment has been minimized.

Copy link

ELLIOTTCABLE commented Feb 27, 2016

(Hey. Let's take this down a notch: the maintainer / creator has made their
point, we've responded to and refuted it in detail … let's give them a
chance to change their mind before pulling out the legal threats, okay?
Just my 2¢.)
On Fri, Feb 26, 2016 at 7:49 PM Thomas Fuchs notifications@github.com
wrote:

I've reported the ghtorrent user as abusive to GitHub, with a link to
this thread. The project maintainer clearly doesn't care blatantly stomping
on our privacy. I recommend that everyone who wants their email removed
does the same.


Reply to this email directly or view it on GitHub
#32 (comment)
.

@madrobby

This comment has been minimized.

Copy link

madrobby commented Feb 27, 2016

@ELLIOTTCABLE how is reporting them for abuse a legal threat? what?

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 27, 2016

This got out of hand. The topic of this issue is removing people's emails from the GHTorrent database. I think all opinions, however harsh or abusive, have been heard. This is a short summary of my opinions and actions in response to the issue:

  • GHTorrent does not track or process private information or personal data. It only tracks data in the public domain and makes them available in a different, easier to use format. Whatever GHTorrent does, anybody can do (and has done).
  • I have already complied with all requests for email removals.
  • I have demonstrated that it is pointless to do the above.

The project will create a process to remove email address in a more systematic manner. It will also create a FAQ page with information about how to use the data in a more privacy-respecting manner.

IMO, the real solution to the problem is that GitHub creates a button that will allow people to opt-in or opt-out the usage of their email for such purposes (similar to the "Available for hire" one). In the mean time, people that do not want their email exposed can keep it private.

To the people that I have removed from GHTorrent; please be aware that you will most probably get more emails from researchers (not because of GHTorrent). The following services can be used to get the same information available on GHTorrent, in exactly the same format and in exactly the same volume:

@gousiosg gousiosg closed this Feb 27, 2016

@ghtorrent ghtorrent locked and limited conversation to collaborators Feb 27, 2016

@gousiosg

This comment has been minimized.

Copy link
Contributor

gousiosg commented Feb 27, 2016

An update: access to GHTorrent data has been suspended until we clear up this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.