# Github Get Repository Data

This **Notebook-as-Tool** allows you to:
1. query multiple types of data (commits, contributors, issues, etc.) for repositories
2. retrieve a list of all repositories of a user or organization

**How to Use:**
1. For running or adapting this Colab Notebook you need to create a copy in you Google drive: **File → Save a copy in Drive**. I will be stored in a folder ```Colab Notebooks```. Open this file with Google Colab and run the cells consecutively by pressing the **Play** button or pushing **shift+enter**.
2. Using the Github API requires authentication. You need to sign up for a free account and create a [personal API access token](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token).
3. Since API access tokens are sensitive data they will be stored in a config file on Google Drive: 
  - Download the Github config template (http://tiny.cc/github-config-template)
  -  Add your information using your preferred code editor like [SublimeText](https://www.sublimetext.com), [Atom](https://atom.io/), [Brackets](http://brackets.io/). 
  - Upload to your Google Drive. I assume that you rename your file to ```github_config.ini``` and place it a folder named ```Colab Data/Configs```.

**Important notes:**
- Code is hidden in the background of Colab forms. For viewing and editing the code **double click** cell or select  **View → Show/hide code**
- Data will be stored in Google Drive in the folder ```Colab Data```. A connection to your drive will be authenticated when running setup code cells. This is temporary and only your current notebook will be conncted to your drive. The connection will be revoked when the notebook is terminated or by selecting **Runtime → Factory reset runtimme**.


**Credits:** 

This notebook was written by Marcus Burkhardt. For interacting with the Github API this notebook uses the PyGithub library (https://pypi.org/project/PyGithub/), which provides higher level functionalities for interacting with the Github API. The documentation for PyGithub can be found here: https://pygithub.readthedocs.io/

In [None]:
#@title Setup 1: Mount Google Drive for Loading and Storing Data
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
#@title Setup 2: Install and Load Required Libraries and Run Setup Procedures

try:
  from github import Github
except: 
  !pip install PyGithub
  from github import Github

import os
import json
import time
import configparser 
import pandas as pd
from datetime import datetime
from tqdm.notebook import tqdm
print('Successful installed and loaded libraries')

# Defining path variable for config path
config_path = os.path.join("gdrive", "MyDrive", "Colab_Data", "Configs")
if not os.path.isdir(config_path):
  os.makedirs(config_path)

# Defining path variable for data path
data_path = os.path.join("gdrive", "MyDrive", "Colab_Data", "Data", "Github")
if not os.path.isdir(data_path):
  os.makedirs(data_path)

try:
  # Reading config and setting configuration values
  config = configparser.ConfigParser()
  config.read(os.path.join(config_path, "github_config.ini"))
  api_token = str(config['Github']['api_token'])

  # Check if API credentials were successfully parsed
  if api_token:
    print('Successful parsed config data.')
except:
  print('Error reading or parsing the config.')

In [None]:
#@title Setup 3: Definition of Core and Support Functions Used by the Tool(s)

def check_rate_limit(api):
    sleep = False
    if api.rate_limiting[0]<10:
        if api.rate_limiting_resettime - time.time():
            sleep = True
    else:
        sleep = False
    while sleep:
        print('Sleeping until {} UTC for {} seconds'.format(datetime.fromtimestamp(api.rate_limiting_resettime), (api.rate_limiting_resettime - time.time() + 10)))
        time.sleep(api.rate_limiting_resettime - time.time() + 10)
        sleep = False 
    return True

def get_repo_data(api, repo_name, get='commits', save=True, save_as='csv'):
    check_rate_limit(api)
    repo = api.get_repo(repo_name)
    
    if get == 'branches':
      api_call = repo.get_branches()
      outfile_name = 'branches'
    elif get == 'commits':
      api_call = repo.get_commits()
      outfile_name = 'commits'
    elif get == 'comments':
      api_call = repo.get_comments()
      outfile_name = 'comments'
    elif get == 'contributors':
      api_call = repo.get_contributors()
      outfile_name = 'contributors'
    elif get == 'collaborators':
      api_call = repo.get_collaborators()
      outfile_name = 'collaborators'
    elif get == 'forks':
      api_call = repo.get_forks()
      outfile_name = 'forks'
    elif get == 'issues':
      api_call = repo.get_issues()
      outfile_name = 'issues'
    elif get == 'labels':
      api_call = repo.get_labels()
      outfile_name = 'labels'
    elif get == 'languages':
      api_call = repo.get_languages()
      outfile_name = 'languages'
    elif get == 'pulls':
      api_call = repo.get_pulls()
      outfile_name = 'pulls'
    elif get == 'releases':
      api_call = repo.get_releases()
      outfile_name = 'releases'
    elif get == 'code_frequency':
      api_call = repo.get_stats_code_frequency()
      outfile_name = 'code_frequency'
    elif get == 'commit_activity':
      api_call = repo.get_stats_commit_activity()
      outfile_name = 'stats_commit_activity'
    elif get == 'contributor_stats':
      api_call = repo.get_stats_contributors()
      outfile_name = 'stats_contributors'
    elif get == 'top_refferers':
      api_call = repo.get_top_referrers()
      outfile_name = 'top_refferers'
    tmp = []

    if type(api_call) is list:
      for i in tqdm(range(len(api_call)), desc="Get " + repo_name + " " + get):
          check_rate_limit(api)
          d = api_call[i]
          tmp.append(d.raw_data)
    elif type(api_call) is dict:
      tmp.append(api_call)
    else:
      for i in tqdm(range(api_call.totalCount), desc="Get " + repo_name + " " + get):
          check_rate_limit(api)
          d = api_call[i]
          tmp.append(d.raw_data)
    #data = pd.json_normalize(tmp)
    data = tmp
    
    data_path
    outpath = os.path.join(data_path, repo_name)
    if not os.path.isdir(outpath):
      os.makedirs(outpath)

    if save:
      if save_as == "csv":
        data = pd.json_normalize(data)
        data.to_csv(os.path.join(outpath, outfile_name+'.csv'), sep='\t', index=None)
        #for i in range(len(data)):
        #  data[i] = pd.json_normalize(data[i])
      elif save_as == "json":
        pass
        with open(os.path.join(outpath, outfile_name+'_'+str(datetime.fromtimestamp(time.time()).date())+'.json'), 'w') as outfile:
          json.dump(data, outfile, indent=4)
      else:
        print('Wrong output type set. Falling back to csv.')
        data = pd.json_normalize(data)
        data.to_csv(os.path.join(outpath, outfile_name+'.csv'), sep='\t', index=None)
        #for i in range(len(data)):
        #  data[i] = pd.json_normalize(data[i])'
    return data

def get_repos(api, repos, get="commits", save=True, save_as='csv'):
  print('Results will be stored in: {}'.format('/'.join(data_path.split('/')[2:])))
  if type(repos) is str:
    repos = [repo.strip() for repo in repos.split(",")]
  data = []
  for repo in tqdm(repos, desc="Overall progress"):
    print('{} API calls currently available. Limit will be reset on {} UTC'.format(api.rate_limiting[0], datetime.fromtimestamp(api.rate_limiting_resettime)))
    data.append([repo, get_repo_data(api, repo, get, save, save_as)])
  return data

In [None]:
#@title Setup 4: Initialize Github API connection
api = Github(api_token)
print('{} API calls currently available. Limit will be reset on {} UTC.'.format(api.rate_limiting[0], datetime.fromtimestamp(api.rate_limiting_resettime)))

In [None]:
#@title Tool 1: Query Data from Repositories with ```user-name/repository-name``` (comma separate multiple repositories)
repos = "" #@param {type:"string"}
get = "commits" #@param ["branches", "commits", "comments", "contributors", "forks", "issues", "labels", "languages", "pulls", "releases", "code_frequency", "commit_activity", "contributor_stats"]
save = True 
save_as = "json" #@param ["csv", "json"]

# run the query
api = Github(api_token)
data = get_repos(api, repos, get=get, save=save, save_as=save_as)

In [None]:
#@title Tool 2: Retrieve a List of Repositories of a User/Organization
name = "" #@param {type:"string"}
api = Github(api_token)
user = api.get_user(name)
repos = user.get_repos()
repos = [item.full_name for item in list(repos)]

print("Copy the below results and paste it in the query tool:")
print()
', '.join(repos)