# The Rise of GitHub

GitHub has become the dominant channel that development teams use to collaborate on code. Wikipedia's [Timeline of GitHub](https://en.wikipedia.org/wiki/Timeline_of_GitHub) documents GitHub's rise to dominance as a *business*. We will use a `mirror` crawl to analyze GitHub's rise as a *platform*.

## Dataset

Any analysis must start with data. The dataset we use here are the results of a crawl of public GitHub repositories conducted using [`mirror`](https://github.com/simiotics/mirror).

You can build the same dataset by using `mirror github crawl` to build up the raw dataset of basic repository information and then `mirror github sync` to create a SQLite database of the type used in this notebook.

If you do create your own dataset, change the variable below to point at the SQLite database you generate.

In [1]:
import os

In [2]:
GITHUB_SQLITE = os.path.expanduser('~/data/mirror/github.sqlite')

Let us explore the structure dataset before we dive into our analysis.

Basic repository metadata (extracted by crawling the GitHub [`/repositories`](https://developer.github.com/v3/repos/#list-all-public-repositories) endpoint) is stored in the `repositories` table of this database. This is its schema:

In [3]:
import sqlite3

In [4]:
conn = sqlite3.connect(GITHUB_SQLITE)
c = conn.cursor()

In [5]:
r = c.execute('select sql from sqlite_master where name="repositories";')

In [6]:
repositories_schema = r.fetchone()[0]

In [7]:
print(repositories_schema)

CREATE TABLE repositories (
        github_id UNSIGNED BIG INT,
        full_name TEXT NOT NULL,
        owner TEXT NOT NULL,
        html_url TEXT NOT NULL,
        api_url TEXT NOT NULL,
        is_fork BOOLEAN
    )


These columns do not provide comprehensive repository information, but they already allow us to understand some interesting things.

To speed up these preliminary analyses, since `mirror` does not automatically create indices in the database, let us create some of our own:

In [None]:
c.execute('create index repositories_owner on repositories(owner);')
c.execute('create index repositories_is_fork on repositories(is_fork);')
conn.commit()

### Number of repositories on GitHub

In [None]:
r = c.execute('select count(*) from repositories;')
result = r.fetchone()
print(result[0])

### Proportion of repositories that are forks

In [None]:
r = c.execute('select sum(is_fork)*1.0/count(*) from repositories;')
result = r.fetchone()
print(result[0])