# Github Archive (gharchive.org)

## What is the Github Archive?

https://www.gharchive.org

> Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
>
>GitHub provides 15+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client

---

* Activity archives are available starting 2/12/2011.
* Activity archives for dates between 2/12/2011-12/31/2014 was recorded from the (now deprecated) Timeline API.
* **Activity archives for dates starting 1/1/2015 is recorded from the Events API.**

---

* Each archive is a compressed (.gz) JSON Lines formatted file.
* Each line of the file is a JSON object representing an event that happened, and is one of 20 types.

## Design: Github Events

### How Github Event types are modeled

> The Events API can return different types of events triggered by activity on GitHub. Each event response contains shared properties, but has a unique payload object determined by its event type. The Event object common properties describes the properties shared by all events, and each event type describes the payload properties that are unique to the specific event.

_More info: https://docs.github.com/en/webhooks-and-events/events/github-event-types_

We use python3's [support for enumerations](https://docs.python.org/3/library/enum.html) to capture the **kinds of things** that exist.

* code completion in editors reduces iterations due to typos in config, and RSI
* the functions provided by the `enum` class are powerful. The HOWTO is detailed: https://docs.python.org/3/howto/enum.html
* we leverage it for configuration


```python
class EventType(Enum):
    '''https://docs.github.com/en/webhooks-and-events/events/github-event-types '''

    def __str__(self):
            return str(self.value)

    CommitCommentEvent            = 'CommitCommentEvent'
    CreateEvent                   = 'CreateEvent'
    DeleteEvent                   = 'DeleteEvent' 
    ForkEvent                     = 'ForkEvent'
    GollumEvent                   = 'GollumEvent'
    IssueCommentEvent             = 'IssueCommentEvent'
    IssuesEvent                   = 'IssuesEvent'
    MemberEvent                   = 'MemberEvent'
    PublicEvent                   = 'PublicEvent'
    PullRequestEvent              = 'PullRequestEvent'
    PullRequestReviewEvent        = 'PullRequestReviewEvent'
    PullRequestReviewCommentEvent = 'PullRequestReviewCommentEvent'
    PullRequestReviewThreadEvent  = 'PullRequestReviewThreadEvent'
    PushEvent                     = 'PushEvent'
    ReleaseEvent                  = 'ReleaseEvent'
    SponsorshipEvent              = 'SponsorshipEvent'
    WatchEvent                    = 'WatchEvent'
```

### Entities

#### Common properties on each event


```json
[
  {
    "type": "WatchEvent",
    "public": false,
    "payload": {
    },
    "repository": {
      "id": 3,
      "name": "octocat/Hello-World",
      "url": "https://api.github.com/repos/octocat/Hello-World"
    },
    "actor": {
      "id": 1,
      "login": "octocat",
      "gravatar_id": "",
      "avatar_url": "https://github.com/images/error/octocat_happy.gif",
      "url": "https://api.github.com/users/octocat"
    },
    "org": {
      "id": 1,
      "login": "github",
      "gravatar_id": "",
      "url": "https://api.github.com/orgs/github",
      "avatar_url": "https://github.com/images/error/octocat_happy.gif"
    },
    "created_at": "2011-09-06T17:26:27Z",
    "id": "12345"
  }
]
```



### Event Example: WatchEvent (GitHub "Star")1

Recall that the header looks like this, note "id" is used as the a primary key as convention.  The helper library fixes up `id` columns to be named `{entity}_github_id`. 

```json
[
  {
    "type": "WatchEvent",
    "public": false,
    "payload": {
    },
    "repository": {
      "id": 3,
      "name": "octocat/Hello-World",
      "url": "https://api.github.com/repos/octocat/Hello-World"
    },
    "actor": {
      "id": 1,
      "login": "octocat",
      "gravatar_id": "",
      "avatar_url": "https://github.com/images/error/octocat_happy.gif",
      "url": "https://api.github.com/users/octocat"
    },
    "org": {
      "id": 1,
      "login": "github",
      "gravatar_id": "",
      "url": "https://api.github.com/orgs/github",
      "avatar_url": "https://github.com/images/error/octocat_happy.gif"
    },
    "created_at": "2011-09-06T17:26:27Z",
    "id": "12345"
  }
]
```

The `name` and `url` fields are specified as `category_cols` - indicating that we should handle them as a [Pandas Category](https://pandas.pydata.org/docs/user_guide/categorical.html).

>The categorical data type is useful in the following cases:
>
> * A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
>
> * The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
>
> * As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

This is similar to the "dictionary compression" used in the Arrow projects. The value is stored once per unique value, rather than being duplicated per event. This produces a dramatic reduction in memory usage, making it viable to be used by an engineer with a MacBook Pro.

More info: https://pandas.pydata.org/docs/user_guide/categorical.html#categorical-memory

Configuration for the `repo` entity:

```json
str(EntityType.Repo).lower(): {
    'description': 'ENTITY: repo',
    'dtypes_write': {
        'repo_github_id': sa_dialect_postgresql.BIGINT,
        'name':           sa_dialect_postgresql.TEXT,
        'org_name':       sa_dialect_postgresql.TEXT,
        'repo_name':      sa_dialect_postgresql.TEXT,
        'url':            sa_dialect_postgresql.TEXT
        },
    'category_cols': ['name', 'url'],
    'embedded_json_cols': [],
    'natural_keys': ['id'],
    'db_table_name': 'repo'
    },
```

### Resources (arrow/feather/pandas)

* https://arrow.apache.org/docs/python/pandas.html
* https://arrow.apache.org/docs/python/feather.html#feather-file-format
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html#pandas-dataframe-to-feather
* https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather

* https://arrow.apache.org/docs/python/data.html#dictionary-arrays

* https://observablehq.com/@uwdata/introducing-arquero
* https://observablehq.com/@uwdata/arquero-and-apache-arrow
* https://observablehq.com/@uwdata/an-illustrated-guide-to-arquero-verbs
