There's a map for that

Not long ago, we began rendering 3D models on GitHub. Today we're excited to announce the latest addition to the visualization family - geographic data. Any .geojson file in a GitHub repository will now be automatically rendered as an interactive, browsable map, annotated with your geodata.

screen shot 2013-06-13 at 10 23 32 am

People are already using GitHub to store everything from Chicago zipcodes to community radio stations to historic hurricane paths, and we can't wait to see what the community will begin to collaborate on next.

Under the hood we use Leaflet.js to render the geoJSON data, and overlay it on a custom version of MapBox's street view baselayer — simplified so that your data can really shine. Best of all, the base map uses OpenStreetMap data, so if you find an area to improve, edit away.

Maps on GitHub support rendering GIS data as points, lines, and polygons. You can even customize the way your data is displayed, such as coloring and sizing individual markers, specifying a more descriptive icon, or providing additional human-readable information to help identify the feature on click.

Looking to get started? Simply commit a .geojson file to a new or existing repository, or dive into the docs to learn how to customize the map's styling.

Git Merge Berlin 2013

Last month GitHub was proud to host the first Git Merge conference, a place for Git core developers and Git users to meet, talk about Git and share what they've been working on or interested in. The first Git Merge was held in Berlin at the amazing Radisson Blu Berlin on May 9-11, 2013.

git-merge

The Git Merge conference came out of the GitTogether meetings that several Git developers held for several years at Google's campus directly after their Google Summer of Code Mentors Summit. We felt that we should hold a similar conference of the Git minds in the EU to accomplish the same things - get Git developers together to meet in person, talk about interesting things they're working on and meet some users.

dev-dinner

This conference was run a little differently than most. It was split up into three days - a Developer Day, a User Day and a Hack Day.

The first day was the developer day, limited to individuals who have made contributions to core Git or one of its implementations such as libgit2 or JGit. About 30 developers came and had discussions ranging from an incremental merge tool, to our participation and success in the Google Summer of Code program, to fixing race conditions in the Git server code.

dev-talking

The second day was the User Day, meant to allow everyone to share tools they were working on or issues they have with Git. The first half of the day was set up entirely in lightning talk format and over 40 talks were given, ranging in duration from a few seconds to nearly 20 minutes. After the lightning talks were done and everyone who wanted to had spoken, we broke up into small group discussions about more specific topics - Laws on GitHub, Git migration issues, tools and tips for teaching Git and more.

The final day was the Hack Day which gave attendees a chance to sit down with people they had met the previous day or two and start working on something interesting.

Notes for the entire conference, collaborated on by attendees, can be found here.

Recorded talks from each day can be found here. Some really interesting examples are Roberto Tyley's bfg-repo-cleaner talk, a tool to clean up bad history in git repositories, and this talk which covers the German federal law repository on GitHub.

Thanks to everyone who attended!

Introducing Octokit

We're happy to announce Octokit, our new lineup of GitHub-maintained client libraries for the GitHub API.

octokits

Today, we're making our first two libraries available.

octokit/octokit.rb

Octokit.rb (formerly pengwynn/octokit) has been developed by the Ruby community over several years and provides access to the GitHub API in idiomatic Ruby.

octokit/octokit.objc

OctoKit has been extracted from GitHub for Mac by the Mac team and is a Cocoa and Cocoa Touch framework for interacting with the GitHub API, built using AFNetworking, Mantle, and ReactiveCocoa.

Introducing GitHub Sudo Mode

In the ongoing effort to keep our users safe, we recently took inspiration from the Unix sudo command. We wanted to require password confirmation for dangerous actions on GitHub.com, but we didn't want to force you to be constantly entering your password.

Meet GitHub's "sudo mode"

sudo mode screenshot

Dangerous actions (e.g. adding email addresses or public keys) will now require password confirmation. If you're deep in the zone and you're doing a lot of these dangerous actions, we'll only ask you to re-authenticate every few hours.

With this balance of security and convenience, we help you keep your account safe, without getting in your way. Feedback is always welcome. Enjoy!

Installing Git from GitHub for Mac

In today's release of GitHub for Mac, you can now easily install Git for use on the command line, without needing to download any separate packages. And whenever we update the version of Git included with GitHub for Mac, you'll get the changes automatically – no work required on your part!

After updating the app, you may notice some changes to the Preferences window. On the newly renamed "Advanced" tab, simply click "Install Command Line Tools":

Advanced preferences pane

You'll be prompted for an administrator password so that Git can be installed into /usr/local/bin, and then you should very shortly see that it succeeded:

Installation Complete

If you're using GitHub for Mac for the first time, and want to install Git, you can also set it up from the welcome screen:

Configure Git welcome screen

Once installed, you can open up Terminal.app and run git commands to your heart's content. Command line aficionados, rejoice!

Update: We've since removed the ability to install Git from GitHub for Mac, because OS X Mavericks and later includes a version of Git already.

Repository redirects are here!

It's a fact of life - sometimes repository names change. This can happen in a few different types of scenarios:

  • When you rename a repository.
  • When you rename your user or organization account.
  • When you transfer a repository from one user or organization to another.

We're happy to announce that starting today, we'll automatically redirect all requests for previous repository locations to their new home in these circumstances. There's nothing special you have to do. Just rename away and we'll take care of the rest.

As a special bonus, we'll also be servicing all Git clone, fetch, and push requests from previous repository locations.

There is one caveat with the new redirect support worth noting: GitHub Pages sites are not automatically redirected when their repositories are renamed at this time. Renaming a Pages repository will continue to break any existing links to content hosted on the github.io domain or custom domains.

Personal API tokens

You can now create your own personal API tokens for use in scripts and on the command line. Be careful, these tokens are like passwords so you should guard them carefully. The advantage to using a token over putting your password into a script is that a token can be revoked, and you can generate lots of them. Head on over to your settings to manage personal API tokens.

screens shot

Wait! There are already some tokens in there!

Don't panic. You've always been able to create arbitrary OAuth access tokens via the API. In fact, if you use tools like hub or boxen they already make use of the authorizations endpoint to generate tokens instead of storing your password.

Closing Issues via Pull Requests

It's been possible to close an issue from a commit for quite a while, but some issues take more work than a single commit to close. That's why you can now close an issue from a Pull Request. All you have to do is include the special keyword syntax (eg. "fixes #5") in the body of your Pull Request.

And the referenced issue will automatically be closed when the PR is merged into the default branch.

You will even see the references as pending fixes before merging.

This works the same way closing an issue from a commit message does. It even works across repositories.

Happy bug fixing!

Jekyll Turns 1.0

GitHub Pages — the easiest way to quickly publish beautiful pages for you and your projects — just got a major upgrade. We're now running Jekyll 1.0.2, which contains over 100 changes and new features. Some of the ones we're most excited to start using:

  • Support for the Gist tag for easily embedding Gists (example)
  • Automatically generated post excerpts (example)
  • Save and preview drafts before publishing (example)
  • Lots of features that make creating and testing sites locally easier

You can read the full changelog to see exactly what's new, and if you generally run Jekyll on your computer, we'd recommend you also check out the information on upgrading.

New to Jekyll? This release also marks the launch of a brand new documentation site designed to help new users dive right in.

Jekyll's come a long way since it started nearly five years ago, and this milestone marks the open source project's first major release. Congratulations to all of the project's contributors. :tada:

GitHub Enterprise 11.10.310 Release

We're excited to announce the latest release of GitHub Enterprise. Along with a variety of general improvements and adjustments, this new release brings the following features from GitHub.com:

In addition, we're also including several new Enterprise specific features:

64-bit Appliance Image

We've been working for some time on 64-bit support and some customers have had early access to these images for quite a while now. We're happy to announce that all new OVA image downloads starting with this release will be 64-bit. GHPs for 32-bit systems will still be available for the foreseeable future to give people running on older appliances the opportunity to migrate at their leisure. You can get more information about how to migrate from a 32-bit appliance to a 64-bit appliance here.

New Management Console Interface

The Management Console interface has remained largely unchanged since we launched GitHub Enterprise nearly a year and a half ago. It worked fairly well, but definitely looked dated and had some problems rendering in Firefox and Internet Explorer. This design refresh was geared largely toward making it work more consistently across browsers, so users who had difficulties using it in browsers other than Chrome should have a better experience now!

687474703a2f2f636c2e6c792f696d6167652f3333324f30533247306331682f636f6e74656e74

GitHub OAuth Authentication

We've added a new authentication method. You can now hook your Enterprise installation up to GitHub.com via OAuth for authentication. You do this by setting up a new OAuth application that belongs to your organization on GitHub.com and then use its client id and secret. After hooking that up, users who are members of your GitHub.com organization will be able to login automatically via the standard OAuth approval process. All their public user information on GitHub.com will be pulled in along with their email addresses and SSH public keys.

687474703a2f2f636c2e6c792f696d6167652f3163326f31383139324930752f636f6e74656e74

Improved Upgrade Process

Perhaps the most common upgrade problem that's encountered involved a timeout being reached during the initial GHP unpacking step. This started happening as the GHP grew in size. To solve this issue, we've moved the GHP unpacking stage into a background job, so the request will no longer timeout, which should improve the upgrade experience dramatically going forward. However, due to how the upgrade process works, you won't see the benefit for this until your next upgrade after 11.10.310. We've also made some improvements that will help prevent cases where successful upgrades were detected as failures.

Better Reporting

In previous releases, it wasn't really possible to get full reports about all repositories, users, or organizations in an installation via the Admin Tools dashboard. Now you can get CSV reports with all of this information easily via the new Reports section.

687474703a2f2f636c2e6c792f696d6167652f30633174327a324b314a33532f636f6e74656e74

Suspending Dormant Users in Bulk

The idea of a dormant user check was updated to work more closely to what a GitHub Enterprise admin would expect by removing some checks that made a lot of sense for GitHub.com, but not so much in a dedicated installation. It's not uncommon to want to see what users are dormant so you know who you want to suspend to free up seats, so in addition to being able to get a report about who's dormant, you can browse dormant users and perform a bulk suspension of all dormant users if you want now.

Improved Search Tooling

We've added a new Indexing section to the Admin Tools dashboard that allows for additional management of search functionality. You can now disable code searching or code search indexing, initiate code search backfill or issue search index repair jobs. You can also see the status of the ElasticSearch cluster on your appliance.

687474703a2f2f636c2e6c792f696d6167652f3258324d3271306e325033632f636f6e74656e74


We hope you enjoy these features as much as we do. Don't forget that there is more information available about GitHub Enterprise at https://enterprise.github.com/. The latest release can always be downloaded from here.

File CRUD and repository statistics now available in the API

Today we're happy to announce two big additions to our API: File CRUD and Repository Statistics.

File CRUD

The repository contents API has allowed reading files for a while. Now you can easily commit changes to single files, just like you can in the web UI.

Starting today, these methods are available to you:

Repository Statistics

We're using the repository statistics API to power our graphs, but we can't wait to see what others do with this information.

Starting today, these resources are available to you:

Enjoy!

Repository Search on all Repositories

Today we are allowing you to search your own public repositories and any private repositories you have access to.

In an effort to simplify our search, we've consolidated our search boxes into one. There's no need to look around anymore for the search box: it's always at the top.

When you're on a repository page, you'll see an indication that you're searching that repository by default:

image

To search globally, all you need to do is select the All repositories option:

image

You may have already noticed that the command bar will also give you these options:

image

Finally, if you didn't find what you were looking for in your repository, you can turn a repository search into a global search by clicking Search all of GitHub.

image

For any search related questions, take a look at a our search guides.

Good luck, gumshoes!

Hey Judy, don't make it bad

Last week we explained how we greatly reduced the rendering time of our web views by switching our escaping routines from Ruby to C. This speed-up was two-fold: the C code for escaping HTML was significantly faster than its Ruby equivalent, and on top of that, the C code was generating a lot fewer objects on the Ruby heap, which meant that subsequent garbage collection runs would run faster.

When working with a mark and sweep garbage collector (like the one in MRI), the amount of objects in the Heap at any given moment of time matters a lot. The more objects, the longer each GC pause will take (all the objects must be traversed during the Mark phase!), and since MRI's garbage collector is also "stop the world", while GC is running Ruby code cannot be executing, and hence web requests cannot be served.

In Ruby 1.9 and 2.0, the ObjectSpace module contains useful metadata regarding the current state of the Garbage collector and the Ruby Heap. Probably the most useful method provided by this module is count_objects, which returns the amount of objects allocated in the Ruby heap, separated by type: this offers a very insightful birds-eye view of the current state of the heap.

We tried running count_objects on a fresh instance of our main Rails application, as soon as all the libraries and dependencies were loaded:

GitHub.preload_all
GC.start
count = ObjectSpace.count_objects

puts count[:TOTAL] - count[:FREE]
#=> 605183

Whelp! More than 600k Ruby objects allocated just after boot! That's a lotta heap, like we say in my country. The obvious question now is whether all those objects on the heap are actually necessary, and whether we can free or simply prevent from allocating some of them to reduce our garbage collection times.

This question, however, is rather hard to answer by using only the ObjectSpace module. Although it offers an ObjectSpace#each_object method to enumerate all the objects that have been allocated, this enumeration is of very little use because we cannot tell where each object was allocated and why.

Fortunately, @tmm1 had a master plan one more time. With a few lines of code, he added a __sourcefile__ and __sourceline__ method to every single object in the Kernel, which kept track of the file and line in which the object was allocated. This is priceless: we are now able to iterate through every single object in the Ruby Heap and pinpoint and aggregate its source of allocation.

GitHub.preload_all
GC.start
ObjectSpace.each_object.to_a.inject(Hash.new 0){ |h,o| h["#{o.__sourcefile__}:#{o.class}"] += 1; h }.
  sort_by{ |k,v| -v }.
  first(10).
  each{ |k,v| printf "% 6d  |  %s\n", v, k }
36244  |  lib/ruby/1.9.1/psych/visitors/to_ruby.rb:String
28560  |  gems/activesupport-2.3.14.github21/lib/active_support/dependencies.rb:String
26038  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route_set.rb:String
19337  |  gems/activesupport-2.3.14.github21/lib/active_support/multibyte/unicode_database.rb:ActiveSupport::Multibyte::Codepoint
17279  |  gems/mime-types-1.19/lib/mime/types.rb:String
10762  |  gems/tzinfo-0.3.36/lib/tzinfo/data_timezone_info.rb:TZInfo::TimezoneTransitionInfo
10419  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route.rb:String
9486  |  gems/activesupport-2.3.14.github21/lib/active_support/dependencies.rb:RubyVM::InstructionSequence
8459  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/route_set.rb:RubyVM::InstructionSequence
5569  |  gems/actionpack-2.3.14.github21/lib/action_controller/routing/builder.rb:String

Oh boy, let's take a look at this in more detail. Clearly, there are allocation sources which we can do nothing about (the Rails core libraries, for example), but the biggest offender here looks very interesting. Psych is the YAML parser that ships with Ruby 1.9+, so apparently something is parsing a lot of YAML and keeping it in memory at all times. Who could this be?

A Pocket-size Library of Babel

Linguist is an open-source Ruby gem which we developed to power our language statistics for GitHub.com.

People push a lot of code to GitHub, and we needed a reliable way to identify and classify all the text files which we display on our web interface. Are they actually source code? What language are they written in? Do they need to be highlighted? Are they auto-generated?

The first versions of Linguist took a pretty straightforward approach towards solving these problems: definitions for all languages we know of were stored in a YAML file, with metadata such as the file extensions for such language, the type of language, the lexer for syntax highlighting and so on.

However, this approach fails in many important corner cases. What's in a file extension? that which we call .h by any other extension would take just as long to compile. It could be C, or it could be C++, or it could be Objective-C. We needed a more reliable way to separate these cases, and hundreds of other ambiguous situations in which file extensions are related to more than one programming language, or source files do not even have an extension.

That's why we decided to augment Linguist with a very simple classifier: Armed with a pocket-size Library of Babel of Code Samples (that is, a collection of source code files from different languages hosted in GitHub) we attempted to perform a weighted classification of all the new source code files we encounter.

The idea is simple: when faced with a source code file which we cannot recognize, we tokenize it, and then use a weighted classifier to find out the likehood that those tokens in the file belong to a given programming language. For example, an #include token is very likely to belong to a C or a C++ file, and not to a Ruby file. A class token can very well belong to a C++ file or a Ruby file, but if we find both an #include and a class token on the same file, then the answer is most definitely C++.

Of course, to perform this classification, we need to keep in memory a large list of tokens for every programming language that is hosted on GitHub, and their respective probabilities. It was this collection of tokens which was topping our allocation meters for the Ruby Garbage collector. For the classifier to be accurate, it needs to be trained with a large dataset --the bigger the better--, and although 36000 token samples are barely enough for training a classifier, they are a lot for the poor Ruby heap.

Take a slow classifier and make it better

We had a very obvious plan to fix this issue: move the massive token dataset out of the Ruby Heap and into native C-land, where it doesn't need to be garbage collected, and keep it as compact as possible in memory.

For this, we decided to store the tokens in a Judy Array, a trie-like data structure that acts as an associative array or key-value store with some very interesting performance characteristics.

As opposed to traditional trie-like data structures storing strings, branches happen at the bit-level (i.e. the Judy Array acts as a 256-ary trie), and their nodes are highly compressed: the claim is that thanks to this compression, Judy Arrays can be packed extremely tightly in cache lines, minimizing the amount of cache misses per lookup. The supposed result of this are lookup times that can compete against a hash table, even though the algorithmic complexity of Judy Arrays is O(log n), like any other trie-like structure.

Of course, there is no real-world silver bullet when it comes to algorithmic performance, and Judy Arrays are no exception. Despite the claims in Judy's original whitepaper, cache misses in modern CPU architectures do not fetch data stored in the Prison of Azkaban; they fetch it from the the L2 cache, which happens to be oh-not-that-far-away.

In practice, this means that the constant loss of time caused by a few (certainly not many) cache misses in a hash table lookups (O(1)) is not enough to offset the lookup time in a Judy array (O(log n)), no matter how tightly packed it is. On top of that, on hash tables with linear probing and a small step size, the point of reduced cache misses becomes moot, as most of the time collisions can be resolved in the same cache line where they happened. These practical results have been proven over and over again in real-world tests. At the end of the day, a properly tuned hash table will always be faster than a Judy Array.

Why did we choose Judy arrays for the implementation, then? For starters, our goal right now is not related to performance (classification is usually not a performance critical operation), but to maximizing the size of the training dataset while minimizing its memory usage. Judy Arrays, thanks to their remarkable compression techniques, store the keys of our dataset in a much smaller chunk of memory and with much less redundancy than a hash table.

Furthermore, we are pitting Judy Arrays against MRI's Hash Table implementation, which is known to be not particularly performant. With some thought on the way the dataset is stored in memory, it becomes feasible to beat Ruby's hash tables at their own game, even if we are performing logarithmic lookups.

The main design constraint for this problem is that the tokens in the dataset need to be separated by language. The YAML file we load in memory takes the straightforward approach of creating one hash table per language, containing all of its tokens. We can do better using a trie structure, however: we can store all the tokens in the same Judy Array, but prefixing them with an unique 2-byte prefix that identifies their language. This creates independent subtrees of tokens inside the same global data structure for each different language, which increases cache locality and reduces the logarithmic cost of lookups.

Judy Array Structure

For the average query behavior of the dataset (burst lookups of thousands of tokens of the same language in a row), having these subtrees means keeping the cache permanently warm, and minimzing the amount of traversals around the Array, since the internal Judy cursor never leaves the subtree for a language between queries.

The results of this optimization are much more positive than what we'd expect from benchmarking a logarithmic time structure against one which allegedly performs lookups in constant time:

Lookup times (no gc)

In this benchmark where we have disabled MRI's garbage collector, we can see how the lookup of 3.5 million tokens on the database stays more than 50% faster against the Hash Table, even as we artificially increase the dataset with random tokens. Thanks to the locality of the token subtrees per language, lookup times remain mostly constant and don't exhibit a logarithmic behavior.

Things get even better for Judy Arrays when we enable the garbage collector and GC cycles start being triggered between lookups:

Lookup times (gc)

Here we can see how the massive size of the data structures in the Ruby Heap cause the garbage collector to go bananas, with huge spikes in lookup times as the dataset increases and GC runs are triggered. The Judy Array (stored outside the Ruby Heap) remains completely unfazed by it, and what's more, manages to maintain its constant lookup time while Hash Table lookups become more and more expensive because of the higher garbage collection times.

The cherry on top comes from graphing the RSS usage of our Ruby process as we increase the size of our dataset:

RSS usage

Once again (and this time as anticipated), Judy Arrays throw MRI's Hash Table implementation under a bus. Their growth remains very much linear and increases extremely slowly, while we can appreciate considerable bumps and very fast growth as hash tables get resized.

GC for the jilted generation

With the new storage engine for tokens on Linguist's classifier, we are now able to dramatically expand our sampling dataset. A bigger dataset means more accurate classification of programming languages and more accurate language graphs on all repositories; this makes GitHub more awesome.

The elephant in the room still lives on in the shape of MRI's garbage collector, however. Without a generational GC capable of finding and marking roots of the Heap that are very unlikely to be freed (if at all), we must keep permanent attention to the amount of objects we allocate on our main app. More objects not only mean higher memory usage: they also mean higher garbage collection times and slower requests.

The good news is that Koichi Sasada has recently proposed a Generational Garbage Collector for inclusion in MRI 2.1. This prototype is remarkable because it allows a subset of generational garbage collection to happen while maintaining compatibility with MRI's current C extension API, which in its current iteration has several trade-offs (for the sake of simplicity when writing extensions) that make memory management for internal objects extremely difficult.

This compatibility with older versions, of course, comes at a price. Objects in the heap now need to be separated between "shady" and "sunny", depending on whether they have write barriers or not, and hence whether they can be generationally collected. This enforces an overly complicated implementation of the GC interfaces (several Ruby C APIs must drop the write barrier from objects when they are used), and the additional bookkeeping needed to separate the different kind of objects creates performance regressions under lighter GC loads. On top of that, this new garbage collector is also forced to run expensive Mark & Sweep phases for the young generation (as opposed to e.g. a copying phase) because of the design choices that make the current C API support only conservative garbage collection.

Despite the best efforts of Koichi and other contributors, Ruby Core's concern with backwards compatibility (particularly regarding the C Extension API) keeps MRI lagging more than a decade behind Ruby implementations like Rubinius and JRuby which already have precise, generational and incremental garbage collectors.

It is unclear at the moment whether this new GC on its current state will make it into the next version of MRI, and whether it will be a case of "too little, too late" given the many handicaps of the current implementation. The only thing we can do for now is wait and see... Or more like wait and C. HAH. Amirite guys? Amirite?

Check the status of your branches

Beginning today, you can head over to your favorite repository's Branches page and see the build status for the HEAD of each branch.

Better yet, the page updates automatically whenever a new build finishes. Enjoy!

live-branch-statuses

Heads up: nosniff header support coming to Chrome and Firefox

Both GitHub and Gist offer ways to view "raw" versions of user content. Instead of viewing files in the visual context of the website, the user can see the actual text content as it was commited by the author. This can be useful if you want to select-all-and-copy a file or just see a Markdown file without having it be rendered. The key point is that this is a feature to improve the experience of our human users.

Some pesky non-human users (namely computers) have taken to "hotlinking" assets via the raw view feature -- using the raw URL as the src for a <script> or <img> tag. The problem is that these are not static assets. The raw file view, like any other view in a Rails app, must be rendered before being returned to the user. This quickly adds up to a big toll on performance. In the past we've been forced to block popular content served this way because it put excessive strain on our servers.

We added the X-Content-Type-Options: nosniff header to our raw URL responses way back in 2011 as a first step in combating hotlinking. This has the effect of forcing the browser to treat content in accordance with the Content-Type header. That means that when we set Content-Type: text/plain for raw views of files, the browser will refuse to treat that file as JavaScript or CSS.

Until recently, Internet Explorer has been the only browser to respect this header, so this method of hotlinking prevention has not been effective for many users. We're happy to report that the good people at Google and Mozilla are moving towards adoption as well. As nosniff support is added to Chrome and Firefox, hotlinking will stop working in those browsers, and we wanted our beloved users, human and otherwise, to know why.