In this notebook we experiment with analyzing all the github data for 2015 using [dask](http://dask.pydata.org/en/latest/) ([github](https://github.com/continuumIO/dask)) for analysis.


**Dask** is a tool for out-of-core, parallel data analysis. We will use [dask.bag](http://dask.pydata.org/en/latest/bag.html), which provides an api for operations on unordered lists (like sets but with duplicates). It is useful for semi-structured data like JSON blobs or log files. More blogposts about dask can be found [here](http://www.continuum.io/blog/tags/dask) or [here](http://matthewrocklin.com/blog/tags.html#dask-ref).

### Github Archive Data on Google Cloud Storage

We took data from [githubarchive.com](githubarchive.com), from January 2015 and put it in Google Cloud Storage so we can get free transfers between there, and google compute, which runs binder.

Lets inspect the data first so we can find something to analyze and learn the data schema.

In [1]:
!rm github-data.tar
!wget  https://storage.googleapis.com/blaze-data/github-data/github-data.tar

rm: cannot remove 'github-data.tar': No such file or directory
converted 'https://storage.googleapis.com/blaze-data/github-data/github-data.tar' (ANSI_X3.4-1968) -> 'https://storage.googleapis.com/blaze-data/github-data/github-data.tar' (UTF-8)
--2015-09-24 21:58:41--  https://storage.googleapis.com/blaze-data/github-data/github-data.tar
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.69.128, 2607:f8b0:4001:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.69.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2822236160 (2.6G) [application/x-tar]
Saving to: 'github-data.tar'


2015-09-24 21:59:19 (71.3 MB/s) - 'github-data.tar' saved [2822236160/2822236160]



In [2]:
!tar xf github-data.tar  # takes about a minute

In [3]:
!du -h data/

2.7G	data/


### Inspect Data with `dask.bag`

We have approximately 4.6 GB of data. One file per hour, averaging around 7.8 MB each (compressed). So we make a dask bag with the data and inspect it to figure out the schema.

In [4]:
from dask.diagnostics import ProgressBar
import dask.bag as db
import ujson as json

# take one file from the bucket load it as a json object, note gz decompression
# happens automatically at compute time.
b = db.from_filenames('data/2015-01-01-*').map(json.loads)
b.npartitions  # number of files

24

In [5]:
first = b.take(1)[0]  # take the first json object from the file
first

{u'actor': {u'avatar_url': u'https://avatars.githubusercontent.com/u/9152315?',
  u'gravatar_id': u'',
  u'id': 9152315,
  u'login': u'davidjhulse',
  u'url': u'https://api.github.com/users/davidjhulse'},
 u'created_at': u'2015-01-01T00:00:00Z',
 u'id': u'2489368070',
 u'payload': {u'before': u'86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
  u'commits': [{u'author': {u'email': u'david.hulse@live.com',
     u'name': u'davidjhulse'},
    u'distinct': True,
    u'message': u'Altered BingBot.jar\n\nFixed issue with multiple account support',
    u'sha': u'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
    u'url': u'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'}],
  u'distinct_size': 1,
  u'head': u'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
  u'push_id': 536740396,
  u'ref': u'refs/heads/master',
  u'size': 1},
 u'public': True,
 u'repo': {u'id': 28635890,
  u'name': u'davidjhulse/davesbingrewardsbot',
  u'url': u'https://api.github

In [6]:
first.keys()  # top level keys in this json object

[u'payload', u'created_at', u'actor', u'public', u'repo', u'type', u'id']

Type looks interesting. What are possible types and how often does each occur? We can inspect this with `dask.bag.frequencies`. But this takes longer because it requires a read of the entire dataset.

In [7]:
with ProgressBar():
    res = b.pluck('type').frequencies().compute()
res

[########################################] | 100% Completed | 14.4s


[(u'ReleaseEvent', 816),
 (u'PublicEvent', 177),
 (u'PullRequestReviewCommentEvent', 2173),
 (u'ForkEvent', 7144),
 (u'MemberEvent', 474),
 (u'PullRequestEvent', 8735),
 (u'IssueCommentEvent', 17045),
 (u'PushEvent', 119242),
 (u'DeleteEvent', 3843),
 (u'CommitCommentEvent', 1399),
 (u'WatchEvent', 21939),
 (u'IssuesEvent', 9843),
 (u'CreateEvent', 23913),
 (u'GollumEvent', 2196)]

Top Committers
----------------

So most events are pushes, that is not surprising. Lets ask "Who pushes the most?".

We do this by filtering out `PushEvent`s. Then we count the frequencies of usernames for the pushes. Then take the top 5.

In [8]:
pushes = b.filter(lambda x: x['type'] == 'PushEvent')  # filter out the push events
names = pushes.pluck('actor').pluck('login') # get the login names
top_5 = names.frequencies().topk(5, key=lambda (name, count): count)  # List top 5 pushers
with ProgressBar():
    res = top_5.compute()  # run the above computations
res

[########################################] | 100% Completed | 13.7s


[(u'kinlane', 3843),
 (u'KenanSulayman', 1912),
 (u'mirror-updates', 1008),
 (u'qdm', 702),
 (u'greatfire', 576)]

These users *pushed* the most, but push can have multiple commits. So we can ask "who pushed the most *commits*?".

We can figure this out by grouping by username, then summing the number of commits from every push, for each user. More technically speaking, we want to `GroupBy` on usernames, so for each username we get a list their of PushEvents. Then reduce each `PushEvent` by taking a `count` of their commits. Then reducing these `count`s by `sum`ing them for each user. So we are grouping then reducing.

However there are algorithms for grouping and reducing simultaneously which avoid expensive shuffle operations and are much faster. In dask bag we have `foldby`. Analogous methods: [`toolz.reduceby`]( https://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.reduceby), and in pyspark [`RDD.combineByKey`](https://spark.apache.org/docs/latest/api/python/pyspark.html?#pyspark.RDD.combineByKey).

In [9]:
def get_logins(x):
    """The key for foldby, like a groupby key. Get the username from a PushEvent"""
    return x['actor']['login']

def binop(total, x):
    """Count the number of commits in a PushEvent"""
    return total + len(x['payload']['commits'])

def combine(total1, total2):
    """This combines commit counts from PushEvents"""
    return total1 + total2

commits = pushes.foldby(get_logins, binop, initial=0, combine=combine)
top_commits = commits.topk(5, key=lambda (name, count): count)
with ProgressBar():
    res = top_commits.compute()
res

[########################################] | 100% Completed | 12.5s


[(u'mirror-updates', 9912),
 (u'kinlane', 3843),
 (u'KenanSulayman', 1912),
 (u'qdm', 703),
 (u'greatfire', 576)]

We can verify this by visiting some of these github profiles:
[kinlane](https://github.com/qdm), [KenanSulayman](https://github.com/KenanSulayman), [qdm](https://github.com/qdm). The "mirror-updates" user doesn't actually show any activity on these days.