# OHW Github analysis

### A meta-hackweek hack

This is a rough and ready approach to measuring the impact of OHW on participants' Github activity

It goes without saying that **commits =/= work done on a project**. I freely admit that many of my commits are nonsense! This is just a fun side project for me to explore the github API and try some simple analysis methods.

First off, to access the [Github API](https://docs.github.com/) you'll need to edit the credentials file `credentials.json` to supply your username and a [Github access token](https://github.blog/2013-05-16-personal-api-tokens/).

Once you have supplied these creds, you are still limited to [5000 requests per hour](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) so, if you get bounced by the API, leave it some time to cool off. Without credentials the limit is far lower and you will would soon generate an error message like this

![](rate-limit-anon.png)

Obviously I have not provided my own credentials! I'm not sure what action github would take if a DDOS on their API originated from my account, but I'm not willing to find out.

### Let's get scraping!

In [1]:
import json
import requests
from collections import Counter
import pandas as pd
import numpy as np

In [2]:
credentials = json.loads(open('credentials.json').read()) #don't forget to add your creds here!

username = credentials['username']
token = credentials['token']

For a start, let's use the API to get some details on my account

In [3]:
user_data = requests.get('https://api.github.com/users/' + credentials['username'],auth = (username,token)).json()
user_data

{'login': 'callumrollo',
 'id': 28703282,
 'node_id': 'MDQ6VXNlcjI4NzAzMjgy',
 'avatar_url': 'https://avatars0.githubusercontent.com/u/28703282?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/callumrollo',
 'html_url': 'https://github.com/callumrollo',
 'followers_url': 'https://api.github.com/users/callumrollo/followers',
 'following_url': 'https://api.github.com/users/callumrollo/following{/other_user}',
 'gists_url': 'https://api.github.com/users/callumrollo/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/callumrollo/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/callumrollo/subscriptions',
 'organizations_url': 'https://api.github.com/users/callumrollo/orgs',
 'repos_url': 'https://api.github.com/users/callumrollo/repos',
 'events_url': 'https://api.github.com/users/callumrollo/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/callumrollo/received_events',
 'type': 'User',
 'site_admin': False,
 'n

Now, I'll try the account of one of my collaborators `ocefpaf`. You can specify any user, though the information returned is less than when you look at your own account

In [4]:
data = requests.get('https://api.github.com/users/' + 'ocefpaf',auth = (username,token)).json()
data

{'login': 'ocefpaf',
 'id': 950575,
 'node_id': 'MDQ6VXNlcjk1MDU3NQ==',
 'avatar_url': 'https://avatars1.githubusercontent.com/u/950575?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/ocefpaf',
 'html_url': 'https://github.com/ocefpaf',
 'followers_url': 'https://api.github.com/users/ocefpaf/followers',
 'following_url': 'https://api.github.com/users/ocefpaf/following{/other_user}',
 'gists_url': 'https://api.github.com/users/ocefpaf/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/ocefpaf/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/ocefpaf/subscriptions',
 'organizations_url': 'https://api.github.com/users/ocefpaf/orgs',
 'repos_url': 'https://api.github.com/users/ocefpaf/repos',
 'events_url': 'https://api.github.com/users/ocefpaf/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/ocefpaf/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Filipe',
 'company': None,
 'blog': 'http://o

We can see a user's core stats. How about their commits and other actions taken? Simple append `/events` to the request query

### Events

In [5]:
data = requests.get('https://api.github.com/users/' + 'callumrollo' +'/events',auth = (username,token)).json()
data[0]

{'id': '13422558148',
 'type': 'PushEvent',
 'actor': {'id': 28703282,
  'login': 'callumrollo',
  'display_login': 'callumrollo',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/callumrollo',
  'avatar_url': 'https://avatars.githubusercontent.com/u/28703282?'},
 'repo': {'id': 270385780,
  'name': 'callumrollo/callumrollo.github.io',
  'url': 'https://api.github.com/repos/callumrollo/callumrollo.github.io'},
 'payload': {'push_id': 5641264312,
  'size': 1,
  'distinct_size': 1,
  'ref': 'refs/heads/master',
  'head': 'a9e3bb9213139f467fb5058847ff063e851b1ce6',
  'before': '79a6e0329968330e91bd9d202fac5257ce5c9ab2',
  'commits': [{'sha': 'a9e3bb9213139f467fb5058847ff063e851b1ce6',
    'author': {'email': 'c.rollo@outlook.com', 'name': 'Callum Rollo'},
    'message': 'Generate Pelican site',
    'distinct': True,
    'url': 'https://api.github.com/repos/callumrollo/callumrollo.github.io/commits/a9e3bb9213139f467fb5058847ff063e851b1ce6'}]},
 'public': True,
 'created_at': '20

Note that the Github API lists only the 30 most recent events

In [6]:
len(data)

30

To get at more events, we use a short loop to access subsequent pages of results. I found out the hard way that the API restricts you to 10 pages.

In [7]:
tgt_user = 'callumrollo'
base_url = 'https://api.github.com/users/' + tgt_user +'/events'
url = base_url
url_list = [base_url]
data = []
page_no = 1
repos_data = []
total_fetched = 0
while (True):
    response = requests.get(url,auth = (username,token)).json()
    data = data + response
    events_fetched = len(response)
    total_fetched = events_fetched + total_fetched
    print(f"Page: {page_no} total events fetched: {total_fetched}")
    
    if total_fetched == 300:
        print(f"\nAPI maxed out! https://docs.github.com/v3/#pagination\n\
        returning only most recent 300 events by {tgt_user}")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break
    
    if (events_fetched == 30):
        page_no = page_no + 1
        url = base_url + '?page=' + str(page_no)
        url_list.append(url)
    else:
        print(f"\n{tgt_user}: all your events are belong to us now")
        print(f"\nevents span the range \n{data[-1]['created_at']}\n{data[0]['created_at']}")
        break

Page: 1 total events fetched: 30
Page: 2 total events fetched: 60
Page: 3 total events fetched: 90
Page: 4 total events fetched: 120
Page: 5 total events fetched: 150
Page: 6 total events fetched: 180
Page: 7 total events fetched: 210
Page: 8 total events fetched: 240
Page: 9 total events fetched: 270
Page: 10 total events fetched: 300

API maxed out! https://docs.github.com/v3/#pagination
        returning only most recent 300 events by callumrollo

events span the range 
2020-06-07T18:05:46Z
2020-09-05T17:21:13Z


This system logs all events: commits, issues, PRs, forks, stars etc. We are only interested in commits.

These are referred to as `PushEvent` in the json entry `type`

In [8]:
for event in data:
    print(event["type"])


PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
CreateEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
CreateEvent
CreateEvent
WatchEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
WatchEvent
CreateEvent
CreateEvent
PushEvent
PushEvent
PushEvent
PushEvent
PushEvent
IssuesEvent
WatchEvent
WatchEvent
PushEvent
PushEvent
PushEvent
PullRequestEvent
PushEvent
PullRequestEvent
PushEvent
PullRequestEvent
PushEvent
PushEvent
PullRequestEvent
PullRequestEvent
PushEvent
PullRequestEvent
CreateEvent
PushEvent
PullRequestEvent
CreateEvent
PushEvent
IssuesEvent
IssuesEvent
PushEvent
PushEvent
PullRequestEvent
PullRequestEvent
CreateEvent
PushEvent
CreateEvent
PushEvent
IssueCommentEvent
IssueCommentEvent
IssuesEvent
IssueCommentEvent
PushEvent
PullRequestEvent
IssueCommentEvent
CreateEvent
IssuesEvent
IssueCommentEvent
IssueCommentEvent
PushEvent
PushEvent
PullRequestEvent
PushEvent
IssueCommentEvent
PushEvent
IssueCommentEvent
IssueCommentEvent
PushEve

In [9]:
commit_events = []
for event in data:
    if event["type"] == "PushEvent":
        commit_events.append(event)
len(commit_events)

217

There are some complications. Not all of these commits in these events are by the Github user we are querying. For instance, some are commits by other users that our target user has merged in.

To work around this, we look through the payoad of each `PushEvent` and retain only the commits associated with the user we are interested in.

**n.b** this approach wil only work if the github username is the exact match for the name the author uses for their git commits. 

In [10]:
tgt_username = requests.get('https://api.github.com/users/' + credentials['username'],
                            auth = (username,token)).json()["name"]
tgt_username

'Callum Rollo'

In [11]:
user_commits = []
for event in commit_events:
    commit_list = event["payload"]["commits"]
    commit_list_author = []
    if len(commit_list)>0:
        for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            #print(commit_username)
            if commit_username == tgt_username:
                commit_list_author.append(commit_list[com_n])
        if commit_list_author:
            event["payload"]["commits"] = commit_list_author
            user_commits.append(event)

In [12]:
len(user_commits)

214

We can print out the name associated with the commits we have selected to confirm

In [13]:
commit_names = []
for event in user_commits:
    commit_list = event["payload"]["commits"]
    for com_n in range(len(commit_list)):
            commit_username = commit_list[com_n]["author"]["name"]
            commit_names.append(commit_username)
print("git usernames and number of commits:")
Counter(commit_names).most_common()

git usernames and number of commits:


[('Callum Rollo', 247)]

In [14]:
print(f"From {len(data)} events we have extracted {len(commit_names)} commits by {tgt_username}")

From 300 events we have extracted 247 commits by Callum Rollo


The final step of this (almost certainly imperfect) data cleaning is to get info on all the commits by this user. We will pull the author, message, SHA, url, repo and date into a pandas dataframe. 

In [15]:
df = pd.DataFrame()
for event in user_commits:
    for commit in event["payload"]["commits"]:
        commit_subset = {"id": event["id"],
                     "datetime" : event["created_at"],
                     "sha" : commit["sha"],
                     "message" : commit["message"],
                     "author" : commit["author"]["name"],
                     "url": commit["url"],
                     "repo": event["repo"]["name"]}
        df = df.append(commit_subset, ignore_index=True)

We index by datetime and have a look at our dataframe

In [16]:
df.index = pd.DatetimeIndex(df.datetime)
df = df.drop("datetime", axis=1)
df

Unnamed: 0_level_0,author,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-09-05 17:21:13+00:00,Callum Rollo,13422558148,Generate Pelican site,callumrollo/callumrollo.github.io,a9e3bb9213139f467fb5058847ff063e851b1ce6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 17:20:55+00:00,Callum Rollo,13422557015,updated categories,callumrollo/callumrollo.github.io,aca1fa72ad196df5351661a971b7de8add348fc6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:52:01+00:00,Callum Rollo,13422458592,Generate Pelican site,callumrollo/callumrollo.github.io,79a6e0329968330e91bd9d202fac5257ce5c9ab2,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:51:33+00:00,Callum Rollo,13422457038,link land rights articles,callumrollo/callumrollo.github.io,ff42b1412ff1437820c46655454821e267f3ed07,https://api.github.com/repos/callumrollo/callu...
2020-09-04 15:36:17+00:00,Callum Rollo,13413978610,"Now pointing at nbviewer, not raw file",callumrollo/quick_tidal_analysis,02e127e2a70cf98008a22654992363abb8471c73,https://api.github.com/repos/callumrollo/quick...
...,...,...,...,...,...,...
2020-06-08 10:52:40+00:00,Callum Rollo,12566251693,typo,callumrollo/callumrollo.github.io,e712699f1d08f2294bd7e462bcb92768f6841e04,https://api.github.com/repos/callumrollo/callu...
2020-06-08 10:49:46+00:00,Callum Rollo,12566225573,nice readme,callumrollo/callumrollo.github.io,7f4e591fd2dd9fe043f240430ba3bf7bb0a4d062,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:09:06+00:00,Callum Rollo,12559178906,fixed website url,callumrollo/callumrollo.github.io,a68bb4bc70b5e1090cd82ba113b2fcc989ecc97d,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:06:28+00:00,Callum Rollo,12559164400,Generate Pelican site,callumrollo/callumrollo.github.io,9786f1f7fd493e2951ecf8df11f31ac8430d0e7f,https://api.github.com/repos/callumrollo/callu...


Now we remove any repeated commits that may have snuck in by a deduplicating on the SHA checksum

**side note** the SHA chescksum uniquely identifies each commit. Even if you had commits by the same author to the same repo with the same message ("added stuff" or something similarly helpful) the SHA will differentiate the two. See more [here](https://www.lifewire.com/what-is-sha-1-2626011)

In [17]:
df = df.drop_duplicates(subset=['sha'])
df

Unnamed: 0_level_0,author,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-09-05 17:21:13+00:00,Callum Rollo,13422558148,Generate Pelican site,callumrollo/callumrollo.github.io,a9e3bb9213139f467fb5058847ff063e851b1ce6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 17:20:55+00:00,Callum Rollo,13422557015,updated categories,callumrollo/callumrollo.github.io,aca1fa72ad196df5351661a971b7de8add348fc6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:52:01+00:00,Callum Rollo,13422458592,Generate Pelican site,callumrollo/callumrollo.github.io,79a6e0329968330e91bd9d202fac5257ce5c9ab2,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:51:33+00:00,Callum Rollo,13422457038,link land rights articles,callumrollo/callumrollo.github.io,ff42b1412ff1437820c46655454821e267f3ed07,https://api.github.com/repos/callumrollo/callu...
2020-09-04 15:36:17+00:00,Callum Rollo,13413978610,"Now pointing at nbviewer, not raw file",callumrollo/quick_tidal_analysis,02e127e2a70cf98008a22654992363abb8471c73,https://api.github.com/repos/callumrollo/quick...
...,...,...,...,...,...,...
2020-06-08 10:52:40+00:00,Callum Rollo,12566251693,typo,callumrollo/callumrollo.github.io,e712699f1d08f2294bd7e462bcb92768f6841e04,https://api.github.com/repos/callumrollo/callu...
2020-06-08 10:49:46+00:00,Callum Rollo,12566225573,nice readme,callumrollo/callumrollo.github.io,7f4e591fd2dd9fe043f240430ba3bf7bb0a4d062,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:09:06+00:00,Callum Rollo,12559178906,fixed website url,callumrollo/callumrollo.github.io,a68bb4bc70b5e1090cd82ba113b2fcc989ecc97d,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:06:28+00:00,Callum Rollo,12559164400,Generate Pelican site,callumrollo/callumrollo.github.io,9786f1f7fd493e2951ecf8df11f31ac8430d0e7f,https://api.github.com/repos/callumrollo/callu...


------------------------
### Scaling it up: work from a list of Github usernames

Now that we have a method for finding commits by a user, the next step is to loop through a list of users. As a test case, I have analysed the commits from [Oceanhackweek2020](https://oceanhackweek.github.io/)

In [18]:
def gh_scrape(tgt_users, cred_file = 'credentials.json', verbose=True):
    """Simple scraping function
    Supply a list of github usersnames ['jane-doe', 'torvalds', 'satoshi_nakamoto']
    Returns a dataframe of unique commits by these users over the last 90 days
    Assumes that the github user is the user with most git commits associated with their github profile
    Rate limited by the Github API to 300 events
    Requires you to supply a Github API token in a credentials.json file
    Verbose switch prints a line for each user with the number of events and commits found
    Returns a pandas dataframe of commit info for all usernames in supplied list
    """
    # Get user supplied credentials for the github API
    credentials = json.loads(open(cred_file).read())
    username = credentials['username']
    token = credentials['token']
    df = pd.DataFrame()
    
    for tgt_user in tgt_users:
        base_url = 'https://api.github.com/users/' + tgt_user +'/events'
        url = base_url
        url_list = [base_url]
        data = []
        page_no = 1
        repos_data = []
        total_fetched = 0
        while (True):
            response = requests.get(url,auth = (username,token)).json()
            if type(response)==dict:
                # Catch when the API returns a dict rather than expected list. Usually a credentials error message
                print(response)
                return
            data = data + response
            events_fetched = len(response)
            total_fetched = events_fetched + total_fetched
            if total_fetched == 300:
                # Requesting more will max out the API
                break
            if (events_fetched == 30):
                # if we fethced 30 events from this page, there will be another one after it
                page_no = page_no + 1
                url = base_url + '?page=' + str(page_no)
                url_list.append(url)
            else:
                # We have collected all events by this user
                break
        commits_events = []
        for event in data:
            # We're only interested in commits, which are classed as "PushEvents"
            if event["type"] == "PushEvent":
                commits_events.append(event)
        if len(commits_events)==0:
            # If the user has no commit events, stop processing
            continue
            
        commit_usernames_list = []
        for event in commits_events:
            # Search though the payload for which git user is associated with each commit
            commit_list = event["payload"]["commits"]
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                        commit_username = commit_list[com_n]["author"]["name"]       
                        commit_usernames_list.append(commit_username)
        # Working on the assumption that the git user with the most commits pushed to Github by this user is the one we want
        c = Counter(commit_usernames_list)
        most_common_username = c.most_common(1)[0][0]
        user_commits = []
        
        # Go back through the commits and pull only the ones by the most common git username
        for event in commits_events:
            commit_list = event["payload"]["commits"]
            commit_list_author = []
            if len(commit_list)>0:    
                for com_n in range(len(commit_list)):
                    commit_username = commit_list[com_n]["author"]["name"]
                    if commit_username == most_common_username:
                        commit_list_author.append(commit_list[com_n])
                if commit_list_author:
                    event["payload"]["commits"] = commit_list_author
                    user_commits.append(event)
        
        # Extract the information we're interested in and put it in a pandas DataFrame
        for commit in user_commits:
            for com_n in range(len(commit["payload"]["commits"])):
                commit_detail = commit["payload"]["commits"][com_n]
                commit_subset = {"id": commit["id"],
                             "datetime" : commit["created_at"],
                             "sha" : commit_detail["sha"],
                             "message" : commit_detail["message"],
                             "author" : commit_detail["author"]["name"],
                             "url": commit_detail["url"],
                             "repo": commit["repo"]["name"]}
                df = df.append(commit_subset, ignore_index=True)
        df.index = pd.DatetimeIndex(df.datetime)
        df = df.drop_duplicates(subset=['sha'])
        if verbose:
            print(f"{tgt_user}: found {len(data)} events containing {len(user_commits)} unique commits by {most_common_username}\n")
    return df


In [19]:
users = ["callumrollo"]
df = gh_scrape(users, cred_file = 'credentials.json')
df

callumrollo: found 300 events containing 214 unique commits by Callum Rollo



Unnamed: 0_level_0,author,datetime,id,message,repo,sha,url
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-09-05 17:21:13+00:00,Callum Rollo,2020-09-05T17:21:13Z,13422558148,Generate Pelican site,callumrollo/callumrollo.github.io,a9e3bb9213139f467fb5058847ff063e851b1ce6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 17:20:55+00:00,Callum Rollo,2020-09-05T17:20:55Z,13422557015,updated categories,callumrollo/callumrollo.github.io,aca1fa72ad196df5351661a971b7de8add348fc6,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:52:01+00:00,Callum Rollo,2020-09-05T16:52:01Z,13422458592,Generate Pelican site,callumrollo/callumrollo.github.io,79a6e0329968330e91bd9d202fac5257ce5c9ab2,https://api.github.com/repos/callumrollo/callu...
2020-09-05 16:51:33+00:00,Callum Rollo,2020-09-05T16:51:33Z,13422457038,link land rights articles,callumrollo/callumrollo.github.io,ff42b1412ff1437820c46655454821e267f3ed07,https://api.github.com/repos/callumrollo/callu...
2020-09-04 15:36:17+00:00,Callum Rollo,2020-09-04T15:36:17Z,13413978610,"Now pointing at nbviewer, not raw file",callumrollo/quick_tidal_analysis,02e127e2a70cf98008a22654992363abb8471c73,https://api.github.com/repos/callumrollo/quick...
...,...,...,...,...,...,...,...
2020-06-08 10:52:40+00:00,Callum Rollo,2020-06-08T10:52:40Z,12566251693,typo,callumrollo/callumrollo.github.io,e712699f1d08f2294bd7e462bcb92768f6841e04,https://api.github.com/repos/callumrollo/callu...
2020-06-08 10:49:46+00:00,Callum Rollo,2020-06-08T10:49:46Z,12566225573,nice readme,callumrollo/callumrollo.github.io,7f4e591fd2dd9fe043f240430ba3bf7bb0a4d062,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:09:06+00:00,Callum Rollo,2020-06-07T18:09:06Z,12559178906,fixed website url,callumrollo/callumrollo.github.io,a68bb4bc70b5e1090cd82ba113b2fcc989ecc97d,https://api.github.com/repos/callumrollo/callu...
2020-06-07 18:06:28+00:00,Callum Rollo,2020-06-07T18:06:28Z,12559164400,Generate Pelican site,callumrollo/callumrollo.github.io,9786f1f7fd493e2951ecf8df11f31ac8430d0e7f,https://api.github.com/repos/callumrollo/callu...


Quite a few commits. What if we want solely the hackweek ones?

In [20]:
df['ohw20_repo'] = df['repo'].str.contains("ohw20")
sum(df['ohw20_repo'] )

19

### OHW analysis

Using the above function and a list of hackweek participants (not included) I grab github commits from the last 90 days

In [21]:
import csv
from itertools import chain
with open('ohw_participants.csv', newline='') as f:
    nest_list = list(csv.reader(f))
ohw_participants_list = list(chain.from_iterable(nest_list))
df_all = gh_scrape(ohw_participants_list, cred_file='credentials.json', verbose=False)

print(f"total commits: {len(df_all)}")

total commits: 1512


That's a lot of commits! 

As you saw when I grabbed the data just from my username, it contains a lot of identifying information. Time for some anonymising

In [22]:
df_min = df_all[['author', 'message']].copy()
df_min['ohw20_repo'] = df_all['repo'].str.contains("ohw20")

In [23]:
uniq_names = np.unique(df_min.author.values)

for n, uniq_name in enumerate(uniq_names):
    df_min = df_min.replace({uniq_name: 'participant-'+str(n)})
    

In [24]:
df_min

Unnamed: 0_level_0,author,message,ohw20_repo
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-08-20 15:08:09+00:00,participant-21,Merge pull request #1 from cbirdferrer/patch-1...,False
2020-08-17 16:24:17+00:00,participant-21,deleting old presentation_figs file,True
2020-08-14 20:00:38+00:00,participant-21,adding improvements to interpolate notebook to...,True
2020-08-14 20:00:38+00:00,participant-21,merging changes,True
2020-08-14 20:00:38+00:00,participant-21,merging interpolate notebook,True
...,...,...,...
2020-08-07 15:58:21+00:00,participant-33,change read.md,False
2020-08-06 22:26:24+00:00,participant-33,change new2.md and create new.md,False
2020-08-06 22:25:02+00:00,participant-33,changes in new.md and new2.md,False
2020-08-06 22:20:59+00:00,participant-33,"edit readme and new, create new2.md",False


In [25]:
df_min.to_csv('ohw_anonymised.csv')

### What to do with this data? 
What cool analysis could we do? Grab the dataset `ohw_anonymised.csv` and give it a go.

I've started some analysis in the companion notebook `ohw_analysis.ipynb` in this directory

-----------------------------

# Going deeper

We can get all the details of a commit by delving deeper into the json structure accessed through the commit url.

**example:**

In [26]:
url = event['payload']['commits'][0]['url']
commit_detail = requests.get(url,auth = (username,token)).json()

The most interesting section is `files`. This gives a summary of the lines changed on each file altered in this commit

In [27]:
commit_detail['files']

[{'sha': '7da037c3bef3e88f51aae44f76be53e8b05764a1',
  'filename': '.gitignore',
  'status': 'added',
  'additions': 2,
  'deletions': 0,
  'changes': 2,
  'blob_url': 'https://github.com/callumrollo/callumrollo.github.io/blob/eb3e5254d8f5cbb7b32d85385809293c4740dc28/.gitignore',
  'raw_url': 'https://github.com/callumrollo/callumrollo.github.io/raw/eb3e5254d8f5cbb7b32d85385809293c4740dc28/.gitignore',
  'contents_url': 'https://api.github.com/repos/callumrollo/callumrollo.github.io/contents/.gitignore?ref=eb3e5254d8f5cbb7b32d85385809293c4740dc28',
  'patch': '@@ -0,0 +1,2 @@\n+__pycache__\n+output'},
 {'sha': '1ab4331aab48efb22f1a9edad26d271105282a9a',
  'filename': '404.html',
  'status': 'removed',
  'additions': 0,
  'deletions': 138,
  'changes': 138,
  'blob_url': 'https://github.com/callumrollo/callumrollo.github.io/blob/b5f1b02d89e51f3ee07660303f048f1ec7b294f8/404.html',
  'raw_url': 'https://github.com/callumrollo/callumrollo.github.io/raw/b5f1b02d89e51f3ee07660303f048f1ec7b29

### Ideas for further analysis
- Use of different filetypes. Particularly .py vs .ipynb
- word cloud of commit messages
- check out non-commit activity: merge, PR, issue...
- geographical/timezone patterns
- how much "crunch" did we get before presentations on Friday?
- examine links between authors (who merged who? Comments mentioning issues?)

### Github is a rich mine of information
I hope this notebook serves as a good reminder that everything we put on it is public and scrapable, so write helpful commit messages!