Severely flawed methodology, wrong results #3

joepie91 · 2017-09-10T10:35:06Z

So, rather than this being a bug on the scripts themselves, this is a bug in the methodology that they implement. According to the article...

Rather than debate whether to measure high-velocity projects via commits, authors, or comments and pull requests, we use a bubble chart to show all 3 axes of data, and plot on a log-log chart to show the data across large scales.

The problem is that this is a completely biased metric. To understand why, let's look at a hypothetical example. Let's pretend for a moment that there are only two ways to make a HTTP request:

Using EveryRequest, which is a library that implements HTTP, FTP, and SCP. Each of those three protocols is maintained by a different author, but within the same codebase; and there is a fourth author that wires it all together in a common interface.
Using HTTPRequest, which is a library that just implements HTTP, and that is maintained by a single author. Similarly, there are FTPRequest and SCPRequest (also each with a single maintainer of its own), but we don't use those because we only need HTTP.

Now, let's say that the userbase looks like this, for each:

EveryRequest: 130 people using it in total; 80 people using it for HTTP, 40 people using it for FTP, 10 people using it for SCP.
HTTPRequest: 100 people using it for HTTP.
FTPRequest: 60 people using it for FTP.
SCPRequest: 20 people using it for SCP.

Now, we get the following data:

Most users: EveryRequest. This is wrong, because for each individual protocol, HTTPRequest/FTPRequest/SCPRequest are more popular, but their userbase is split between multiple projects.
Most commits: EveryRequest. It contains all the commits for each protocol, thereby likely having more commits in total than any of the other three projects, even if it implements the same protocols, with the same degree of maintenance, and so on. Therefore, also wrong.
Most authors: Again, EveryRequest has 4 authors, as opposed to 1 author for each of the other libraries; which, for those other libraries, adds up to 3 since there's nobody who needs to tie the interfaces together. Yet there's the same degree of maintenance (one author per protocol), and thus this result is, again, wrong.
Most issues: Again, EveryRequest comes out 'on top' because it has more total users than any of the other libraries individually, even if per protocol it has both less users and less issues. Which, again, is wrong. An additional factor here is that EveryRequest may be less reliable, and therefore generate more issues than any of the other libraries, simply because its quality is lower.
Most PRs: Once again, EveryRequest comes out on top. It gets PRs for all three of the protocols, while each of the other projects only get PRs for the single protocol they implement. Even if the net amount of contributions to those is higher.

In all of the above data points, EveryRequest comes out on top; in all of them, incorrectly so. The reason this happens, is that the wrong unit of measure ("project") is used; a more accurate measurement would have been by feature. How many authors maintain a given feature? How many people use it? How many contributions are received to it?

As it stands, the metrics greatly favour monolithic projects, which are necessarily going to be corporate projects; it's already well-understood that project structure often mirrors the organizational structure of the organization or environment in which it was developed. This means that corporations are used to developing monolithic internal projects, and have simply extended this practice to their open-source projects (which can indeed be seen from the architecture of many of the listed projects).

On the other hand, individual developers are more likely to build smaller, single-purpose projects that can be integrated with other software, and that are often deployed far, far more widely than these "high-velocity" monolithic projects. How many people really use OpenStack, for example? It's primarily used internally at companies for large infrastructure deployments, and that's also where its contributions come from, because it tries to handle everything in that infrastructure in a single project (even if composed of multiple parts).

(There are quite a few other unaddressed questions here, too. Was the project always corporate-backed, or did that only happen after the bulk of the contributions? How many of the contributions are made by third parties, and how many are made by employees of the corporation running the project? How much does the corporation really contribute?)

In other words: the way you're measuring favours corporate open-source projects, and has therefore already decided the outcome of the research before even starting on it. This is already a problem from a research perspective, but it's made worse by the chilling effect this can have on individual open-source contributions (especially when published on the Linux Foundation site!), by making individual contributors feel like the open-source community is no longer 'theirs'. There are real and serious consequences to this.

I would strongly recommend retracting the article and informing press (eg. TechRepublic) of that, or at the very least adding a clear notice at the top that the research is not reliable. As it stands, it's extremely misleading.

EDIT: From a quick glance at the article, it also seems like this data was based entirely on GitHub projects alone, which introduces further bias. There are many other platforms (including self-hosted!) that are often used for maintaining non-corporate projects.

dankohn · 2017-09-11T01:11:10Z

Thanks for the detailed comments, though I unsurprisingly disagree. The velocity metrics are attempting to measure project velocity; that is how fast a project is moving. They are not trying to measure goodness, or importance, or conciseness of code, or any one of a million other factors you might want to measure. But many developers strongly prefer to use projects that many other people are using and contributing to and velocity is a decent measure of that.

Now, it's fine to consider whether a project like left-pad might score highly in some metric of importance because it is referenced by so many other libraries. But this effort never claimed to measure importance, just velocity, and we should be able to agree that left-pad has a very small number of issues, commits, and authors.

Note that although Asay cites me as his source, I had nothing to do with his article and do not endorse it. However, I do very much stand behind the article I wrote.

joepie91 · 2017-09-11T11:49:59Z

If the article were just about 'velocity' (insofar that is a quantifiable metric), then this wouldn't be a problem. The problem is caused by statements like this:

What differentiates the most successful open source projects? One commonality is that most of them are backed by either one company or a group of companies collaborating together

So, tracking the projects with the highest developer velocity can help illuminate promising areas in which to get involved, and what are likely to be the successful platforms over the next several years.

So, what’s the takeaway? Software development is hard. Running a large open source project is even harder. So, it is often helpful to have backing from an individual company or a consortium of them working through a software foundation.

The structure and funding from a foundation or corporate sponsor provides more confidence that the project will remain active and stable over the long term.

All of these are nonsensical conclusions, based on the data you have. "Success" is not reasonably measured in contributions; it is measured in stability, adoption, level of support, and many other things that do not directly correlate to the absolute amount of contributors or issues/PRs (as I've already explained above).

The claim that "it's helpful to have backing from [a company to run an open-source project]" is also completely unfounded; in no way does the data prove that statement to be the case. The only thing you've proven with your data is that large monolithic projects with lots of commits/authors/issues are often corporately-backed; which, aside from being obvious for the reasons I've already described, simply doesn't translate to any of the other claims you're making.

Stability is also something that doesn't come from a high amount of contributions; quite the opposite, it's something that comes from feature-completeness and relatively few contributions, ie. the exact opposite of what monolithic projects turn into. The ideal project doesn't need to remain 'active' for more than a few months because it is done.

This is like saying "the sky is blue, you can just look up to see that, therefore blue is the most important color" -- all you've proven is that the sky is blue (or at least appears so), but the second part of the statement ("blue is the most important color") is just thrown in there without any backing, even if it seems superficially related. It's the same problem here; your data does not support your conclusions.

And to address one particular point separately:

But many developers strongly prefer to use projects that many other people are using and contributing to and velocity is a decent measure of that.

Not only is this potentially wrong (How many is "many"? Is it statistically significant?), it's also often a misguided approach on the part of the developers who do prefer monolithic "high-velocity" projects; often they have given no consideration to how well particular features are supported, for example, and end up replacing tools or dependencies down the line.

It also has no relevance to real-world concerns like operational/development cost, ease of development, maintainability, security, and so on.

In short: you need to either fix the methodology to support the conclusions you're drawing, or fix the conclusions you're drawing to reflect the data from your current methodology. But as it stands, the conclusions here are nonsensical, and this sounds more like a marketing fluff piece than like legitimate research.

EDIT:

Thanks for the detailed comments, though I unsurprisingly disagree.

That's a very worrying comment. In a research context, one should always be willing to have their research critically assessed; your statement, however, implies that you've already made up your mind and that no amount of criticism is going to change it. That's not how good research is produced.

namliz · 2017-09-11T18:44:16Z

@joepie91 let me congratulate you on the counterexample you provided in your original post. This is called Simpson's paradox and you're entirely not wrong in pointing it out.

I'll further agree with you that the number of commits, authors, comments, pull requests, stars, 'etc - these are in fact all biased metrics in a sense. The bias is towards large and popular projects.

the wrong unit of measure ("project") is used; a more accurate measurement would have been by feature.

At this point things start to make less sense, because that's just, like, your opinion man.
Allow me to make a simple analogy, behold: the Box Office Mojo's top movies of 2017.

Sadly the list is dominated by big budget CGI-ladden corporate backed drivel and the box office returns do not strictly correlate to quality - although let me remind you, quality is subjective, and some of us enjoyed The Emoji Movie and Angular2: Electric Boogaloo quite a bit.

There's no accounting for taste, I agree with you, these people are simply wrong and Lego Batman and React are much better.

Incidentally, do you have any specific and actionable suggestions on how the methodology be changed to account for this?

Measuring how many contributions, maintainers, and users there are on a feature-basis a grand idea. I haven't got the foggiest how you'd collect such data across the entire open source ecosystem. Would be neat.

Furthermore, I think a code quality metric based on stability and elegance of algorithms and impact and what not - kind of like a rotten tomatoes for repositories, with subjective ratings by expert reviewers, is a fantastic idea.

I might actually have to go build that (damn you), but it currently doesn't exist and would measure entirely different things.

To get back to the subject at hand:

In other words: the way you're measuring favours corporate open-source projects, and has therefore already decided the outcome of the research before even starting on it.

it's also often a misguided approach on the part of the developers who do prefer monolithic "high-velocity" projects; often they have given no consideration to how well particular features are supported, for example, and end up replacing tools or dependencies down the line.

Well no, look here. You don't always get to pick a technology on technical merits alone unless you develop in a vacuum.

The robustness of an ecosystem, and dare I say for lack of better term, velocity, surrounding a project is a useful feature of sorts to base your decision on. Not sole feature, but useful and interesting one. For better and worse.

If I had to pick a JS framework for my next project - they are all absolutely dreadful in my opinion - I'd definitely want to get some sense of adoption.

It isn't a coincidence that there are corporate projects at the top of the list (Angular2, React, Polymer, that other one Ebay made that wise asses like to throw into the mix occasionally to mess with your sanity) - definitely all kinds of agendas at play there.

I could run away from Angular2 because I've seen this pattern of adoption/hype before with the corporate train wreck that is Angular1... or towards it if I wanted to pick up essentially a front end COBOL that will ensure me lucrative employment for, shudder, decades to come.

Or, behold, I could point to the one man show of Vue.js giving the big boys a run for their money and convince the stake holders in my company that they'd totally be able to hire Vue.js developers in the future because it has a robust community growing at a high velocity.

Ain't that grand?

People are free to draw their own conclusions from the underlying data, the charts produced by this project are obviously interesting and relevant (even if it doesn't cover your specific curiosities) and are based on the best metrics available at hand.

I'd be very happy to see a better chart from you instead of complaining that popular things are in general popular for the wrong reasons, that's not useful.

joepie91 · 2017-09-11T19:04:19Z

Incidentally, do you have any specific and actionable suggestions on how the methodology be changed to account for this?
Measuring how many contributions, maintainers, and users there are on a feature-basis a grand idea. I haven't got the foggiest how you'd collect such data across the entire open source ecosystem.

To be clear - I'm not saying that measuring on a per-feature basis is the end-all-be-all of accurate research in this area, and there may very well be other issues with it that I haven't considered; but it would address this particular issue with the current methodology.

It is indeed a much, much more difficult metric to obtain; but there exists no law that research must be simple to carry out. The influence of corporate funding on open-source is simply a really difficult thing to quantify (or even research!), and a lot of work will be needed to paint a comprehensive picture of it. Certainly more work than a few scripts scraping GitHub for contributor counts.

That it's really difficult to obtain accurate results, doesn't in any way justify the publication of inaccurate (but easy-to-obtain) results. It simply means that the gathered data did not lead to a useful conflusion, and that it should be either discarded or, ideally, published with a very clear description of the issues it has, such that it can still be used in other research.

That's fundamentally my issue here; conclusions are drawn from this data that the data doesn't support, and they are presented as conclusive and accurate, when they really aren't in the slightest. That leads to articles like that of TechRepublic, which in turn leads to misguided ideas among the general public.

I might actually have to go build that (damn you), but it currently doesn't exist and would measure entirely different things.

I'd be very happy to see something like this exist :) However, it too is no small task to undertake, and there are currently easier wins to be had in other areas; for example, teaching developers how to recognize quality and support issues with dependencies early on by themselves.

The robustness of an ecosystem, and dare I say for lack of better term, velocity, surrounding a project is a useful feature of sorts to base your decision on. Not sole feature, but useful and interesting one. For better and worse.

The problem here is that you're not taking into account that this is a relative metric. For a large, monolithic thing that does many things and needs constant maintenance, you need a stable support network behind it, and you need to have a pool of competent developers to work with it. Training a new developer to work with it is expensive.

This is not true for small, modular dependencies that do one thing; they're "just a bit of other code in the language I know" that any developer can work with after looking at it for ten minutes, because there's no large proprietary ecosystem built around it with decades of habits and oddities that could never be ironed out for backwards compatibility reasons, or components that didn't quite work together because the project maintainers hadn't anticipated your usecase.

Yes, the support base for small modular dependencies is often a lot smaller; but at the same time, the support requirements are also smaller, and typically by a far larger margin than the support base.

When you look at it on a scale of "does the support base meet my support requirements", the answer is going to be "yes" far more often for a modular dependency than for a monolithic one, even when the modular dependencies are one-man projects and the monolithic dependencies are corporate-backed. For many modular dependencies, the support requirements are zero.

(I can't speak for Vue in particular; I don't use it and have no experience with it, and I don't know how modular it really is. This is about modular dependencies in general.)

People are free to draw their own conclusions from the underlying data, the charts produced by this project are obviously interesting and relevant (even if it doesn't sadly cover your specific curiosities) and are based on the best metrics available at hand.

Again: my problem is with the conclusions that are being drawn in the article. If this were a raw data dump, or if the conclusions being drawn were accurate and supported by the data, there'd be no problem. But as it stands, the conclusions in the article are wrong; and "the best metrics available at hand" simply do not meet the minimum bar required to support those conclusions, and therefore should not be used for it. Sometimes the answer is not to publish at all, not to stubbornly push through to have something.

I'd be very happy to see a better chart from you instead of complaining that popular things are in general popular for the wrong reasons, that's not useful.

A better chart of what? I'm pointing out that the conclusions in the article do not match the data, and the remark about developers picking tools for the wrong reasons was simply an example of an unsupported leap of logic being made here.

There's absolutely no obligation to present alternative data or conclusions, when reviewing and criticizing somebody elses research. The criticism stands on its own. I'd be happy to have a discussion about what useful conclusions can be drawn from the data collected here, but that is not what this issue is about, and it's a separate discussion to have.

lukaszgryglicki · 2017-09-12T04:49:03Z

Just a comment.
This is not a GitHub only data.
Data also comes from CloudFoundry, Jira, OpenStack, SVN, LKMA ...
But yes, the main source of data is GitHub.

lukaszgryglicki closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severely flawed methodology, wrong results #3

Severely flawed methodology, wrong results #3

joepie91 commented Sep 10, 2017 •

edited

Loading

dankohn commented Sep 11, 2017

joepie91 commented Sep 11, 2017 •

edited

Loading

namliz commented Sep 11, 2017 •

edited

Loading

joepie91 commented Sep 11, 2017

lukaszgryglicki commented Sep 12, 2017 •

edited

Loading

Severely flawed methodology, wrong results #3

Severely flawed methodology, wrong results #3

Comments

joepie91 commented Sep 10, 2017 • edited Loading

dankohn commented Sep 11, 2017

joepie91 commented Sep 11, 2017 • edited Loading

namliz commented Sep 11, 2017 • edited Loading

joepie91 commented Sep 11, 2017

lukaszgryglicki commented Sep 12, 2017 • edited Loading

joepie91 commented Sep 10, 2017 •

edited

Loading

joepie91 commented Sep 11, 2017 •

edited

Loading

namliz commented Sep 11, 2017 •

edited

Loading

lukaszgryglicki commented Sep 12, 2017 •

edited

Loading