Last year I did a small exploration of GitHub to show the various communities using GitHub and how they work. I wanted to do it again this year, but I was lacking time and motivation to start over. A couple of months ago, I got a message from mojombo asking me if I was planning to do a new poster. This triggered the motivation to work on it again.
And of course, the poster. Feel free to print it yourself, the size of the poster is A1.
All the data are available! Last year I got some mails asking me for the dataset. So this time I asked first if I could release the data with the code and the poster, and the anwser is yes! So if you're intereseted, you can download it.
The data are stored in mongodb, so I provide the dump which you can easily use:
% wget http://maps.stargit.net/dump/github. % tar xvzf github.tgz % cd github % mongorestore -d github .
Now you can use mongodb to browse the imported database. There is 5 collections: profiles / repositories / relations / contributions / edges.
Last year I did a simple "follower/following" graph. It was already interesting, but it was also really too simple. This time I wanted to go deeper in the exploration.
The various step to process all this data are:
- using the GitHub API, fetch informations from the profiles.
- when all the profiles are collected, informations about the repositories are fetched. Only forked repositories are kept.
- "simple" relations (followers/following) are kept and used later to add weight to relations.
- tag user with the main programming language they use. Using the GitHub API, I was able to categorize ~40k profiles (about 1/3 of my whole dataset).
- using the GeoNames API, extract the name of the country the user is in. This time, about 55k profiles were tagged.
- fetch contributions for each repositories
- compute a score between the author of the contribution and the owner of the repo
- add a weight to each edges, using the computed score and "+1" if the developer follow the other developer
For all the graphs, I've used the following colors for:
- C (C++, C#)
- JVM (Java, Clojure, Scala)
- Lisp (Emacs Lisp, Common Lisp)
Feel free to do your own analysis in the comments :) For each map, you'll find a PDF of the map, and the graph to explore using gephi (in GEXF or GDF format).
but first, some numbers
- 123 562 profiles
- 2 730 organizations
- 40 807 repositories
This took me about a month in order to collect the data and to build the adapted tools.
The following chart show the number of account created by month. "Everyone" means the total of accounts created. You can also see the numbers for each communities.
On the "Everyone" graph, you can see a huge pick around April 2008, that's the date GitHub was launched.
For most of the communities, the number of created accounts start to decrease since 2010. I think the reason is that most of the developers from those communities are now on GitHub.
(Keep in mind that these numbers are coming from the profiles I was able to tag, roughly 40k)
- Ruby: 10046 (28%)
- Python: 5403 (15%)
- C: 5093 (14%) (C, C++, C#)
- PHP: 3933 (11%)
- JVM: 3790 (10%) (Java, Clojure, Scala, Groovy)
- Perl: 1215 (3%)
- Lisp: 348 (0%) (Emacs Lisp, Common Lisp)
Those numbers doesn't really match "what GitHub gave":https://github.com/languages, but it could be explained by the way I've selected my users.
- United States: 19861 (36%)
- United Kingdom: 3533 (6%)
- Germany: 3009 (5%)
- Canada: 2657 (4%)
- Brazil: 2454 (4%)
- France: 1833 (3%)
- Japan: 1799 (3%)
- Russia: 1604 (2%)
- Australia: 1441 (2%)
- China: 1159 (2%)
The United States are still the main country represented on GitHub, no suprise here.
If you are interested in the "geography" of Open Source, you should read these two articles: Coding Places and Investigating the Geography of Open Source Software through GitHub.
Looking at the "company" field on user's profile, here are some stats about which companies has employees using GitHub:
- ThoughtWorks: 102
- Google: 66
- Mozilla: 65
- Yahoo!: 65
- Red Hat: 64
- Globo.com: 55
- Twitter: 53
- Facebook: 45
- Yandex: 43
- Intridea: 34
- Microsoft: 33
- Engine Yard: 32
- Pivotal Labs: 29
- MIT: 28
- Rackspace: 27
- IBM: 24
- Caelum: 23
- Novell: 22
- GitHub: 22
- VMware: 22
I didn't knew the first company, ThoughtWorks, and I was expecting to see FaceBook or Twitter as the company with most developpers on GitHub. It's also interesting to see Yandex here.
Global graph (1628 nodes, 9826 edges)
The main difference with last year, is the android / modders community. They're developing mostly in C and Java. The poster has been created from this map.
Ruby (1968 nodes, 9662 edges)
Python (1062 nodes, 2631 edges)
Here we have some clusters. I'm not familiar with the Python community, so I can't really give any insight.
Perl (608 nodes, 2967 edges)
I really like this graph since it show (in my opinion) one of the real strength of this community: everybody works with everybody. People working on a webframework will collaborate with people working on Moose, or an ORM, or other tools. It shows that in this community, people are competent in more than one field.
The Perl community is about the same size as last year. However, we can extract the following informations:
- the Japaneses Perl Hackers are still a cluster by themselves
- miyagawa is still the glue between the Japanese community and the "rest of the world"
- other leaders are: Florian Ragwitz (rafl), Andy Amstrong (AndyA), Dave Rolsky (autarch)
- some clusters exists for Moose and Dancer.
As we can see on the previous charts, the number of created accounts for the Perl developpers is stalling.
United States (2646 nodes, 11344 edges)
This one is really nice. We can clearly see all the communities. There is something interesting:
- C and Ruby are on the opposite side (C on the left, Ruby on the right)
- Python and Perl are also opposed (Perl at the bottom and Python at the top)
I'll let you take some conclusion by yourself on this one ;)
France (706 nodes, 1059 edges)
We have a lot of small clusters on this one, and some very big authorities.
Japan (464 nodes, 1091 edges)
There is three dominants clusters on this one:
The Ruby and Perl one are well connected. There is a lot of japanese hacker on CPAN using both languages.
I would like to thanks the whole GitHub team for being interested in the previous poster and to ask another one this year :)
A huge thanks to Alexis for his help on building the awesome StarGit. Another big thanks to Antonin for his work on the poster.