# Scrap repo stats using PyGitHub

### Author: Crystal Zang

This notebook utilized GitHub access tokens (PAT) to scrape GitHub repository statistics such as stargazers, watchers, forks, and topics. One PAT would scrape at a rate of 5000 repositories per hour. Utilizing 36 PATs we would scrape 10,288,063 repositories in about 663 hours at a rate of 15,514 repositories per hour. Note that if a repository do not exist(being deleted), it won't be saved to our final dataset. And we did not use multiprocessing.

#### Warnings
You should not commit any access topen to GitHub, which would result in access token being revoked.


In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])?  y


In [2]:
# load packages 
import os
import psycopg2 as pg
from sqlalchemy import create_engine
import pandas as pd
import requests as r
import string 
import json
import base64
import urllib.request
import itertools 
import numpy as np
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from github import Github, RateLimitExceededException, BadCredentialsException, BadAttributeException, GithubException, UnknownObjectException, BadUserAgentException
import warnings
import datetime

import multiprocessing
#from multiprocessing.pool import ThreadPool as Pool
from multiprocessing import Pool, freeze_support

import concurrent.futures

warnings.simplefilter(action='ignore', category=FutureWarning)

### Get the repo slugs that we plan to scrape

In [3]:
#os.environ['db_user'] = ''
#os.environ['db_pwd'] = ''

# connect to the database, download data, limit to repos with at least 20,000 commits?
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

raw_slug_data = '''SELECT * FROM gh_2007_2020.repos_ranked WHERE (commits BETWEEN '700' AND '800')'''
#raw_slug_data = '''SELECT * FROM gh_2007_2020.repos_ranked WHERE commits < 1000'''

# convert to a dataframe, show how many missing we have (none)
raw_slug_data = pd.read_sql_query(raw_slug_data, con=connection)

connection.close()


print(raw_slug_data.head())
print(raw_slug_data.shape)
print(raw_slug_data.isna().sum())

                                 id      spdx                     slug  \
0  MDEwOlJlcG9zaXRvcnkzMDgyMjMzNw==       MIT      WenchaoD/FSCalendar   
1  MDEwOlJlcG9zaXRvcnk5Mjc0MDI1Ng==   GPL-3.0  halvors/Nuclear-Physics   
2  MDEwOlJlcG9zaXRvcnkyODUyODU5ODU=       MIT         LCYforever/lqt5_   
3  MDEwOlJlcG9zaXRvcnkxNjg5MzQ5MDA=  AGPL-3.0  AragonBlack/fundraising   
4  MDEwOlJlcG9zaXRvcnk3MjI2MzgwMQ==       MIT           guyellis/learn   

            createdat                                        description  \
0 2015-02-15 08:43:09  A fully customizable iOS calendar library, com...   
1 2017-05-29 12:59:30  Nuclear Physics is a mod that brings in realis...   
2 2020-08-05 12:47:19                                               None   
3 2019-02-03 10:47:17    Fundraising apps suite for Aragon organizations   
4 2016-10-29 03:56:49               Math exercises for kids aged 5 to 12   

  primarylanguage                                        branch  commits  \
0     Objective-C  MDM

In [4]:
#raw_slug_data = pd.read_csv('/home/zz3hs/git/dspg21oss/data/dspg21oss/crystal_to_scrape_0712.csv') #import csv

In [5]:
#get rid of leading and ending space, save slugs to a list
raw_slugs = raw_slug_data["slug"].tolist()
slugs = []
for s in raw_slugs:
    slugs.append(s.strip())  
print(len(slugs))
print(slugs[0], slugs[len(slugs)-1])

16734
WenchaoD/FSCalendar phil-el/phetools


### Get Access Tokens

In [8]:
#os.environ['db_user'] = ''
#os.environ['db_pwd'] = ''

# connect to the database, download data, limit to repos with at least 20,000 commits?
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

#PATs access token, saved as a dataframe
github_pats = '''SELECT * FROM gh_2007_2020.pats_update'''
github_pats = pd.read_sql_query(github_pats, con=connection)

#PATs access token, saved as a list
access_tokens = github_pats["token"]

#number of tokens available for use, a numeric value
num_token = '''SELECT COUNT(*) FROM gh_2007_2020.pats_update'''
num_token = pd.read_sql_query(num_token, con=connection)
num_token=num_token.iloc[0]['count']

connection.close()

In [9]:
# index ranges from 0 to maximum number of PATs available
def get_access_token(github_pat_index):
    if github_pat_index < num_token:
       # print("Extracting access token #", github_pat_index+1,", total", num_token, "tokens are available.")
        return github_pats.token[github_pat_index]
    else:
        print("token exceed limit")

In [10]:
len(access_tokens)

34

# Scraping function, not using any multiprocessing

In [11]:
def pull_repo_stats(github_pat_index, slugs):
    df_repo_stats = pd.DataFrame()
    for slug in slugs:
        if github_pat_index >= len(access_tokens):
            github_pat_index -= len(access_tokens)
            print("***Pat access token exceed limit, restart access token loop with #", github_pat_index)
        while github_pat_index < len(access_tokens):
            try:
                access_token = get_access_token(github_pat_index)
                #print("Scrapping --", slug,". Extracting access token #", github_pat_index+1,", total", num_token, "tokens are available.")
                #if false, retry until true, max number of retry is 20 times
                g = Github(access_token, retry = 20, timeout = 15)
                repo = g.get_repo(slug)
                df_repo_stats = df_repo_stats.append({
                    "slug": slug,
                    'stars': repo.stargazers_count,
                    'watchers': repo.subscribers_count,
                    'forks': repo.forks_count,
                    'topics': repo.get_topics()
                }, ignore_index = True)
            except RateLimitExceededException as e:
                print(e.status)
                print('Rate limit exceeded --', slug, ", using access token #", github_pat_index)
                print("Current time:", datetime.datetime.now())
                #time.sleep(300)
                github_pat_index+=1
                print("***Exit current access token, proceed with next aceess token #", github_pat_index, "rescrape --",slug)
                break
            except BadCredentialsException as e:
                print(e.status)
                print('Bad credentials exception --', slug, ", using access token #", github_pat_index)
                print("Current time:", datetime.datetime.now())
                github_pat_index+=1
                print("***Exit current access token, proceed with next aceess token #", github_pat_index, "rescrape --",slug)
                break
            except UnknownObjectException as e:
                print(e.status)
                print('Unknown object exception --', slug)
                break
            except GithubException as e:
                print(e.status)
                print('General exception --', slug)
                break
            except r.exceptions.ConnectionError as e:
                print('Retries limit exceeded --', slug)
                print(str(e))
                time.sleep(10)
                continue
            except r.exceptions.Timeout as e:
                print('Time out exception --', slug)
                print(str(e))
                time.sleep(10)
                continue
            break
    return df_repo_stats



In [12]:
len(slugs)

16734

In [None]:
start_time = datetime.datetime.now()
print("Start scraping:", start_time)
df_repo_stats = pull_repo_stats(0, slugs) #specify the index of pat you want use to start scraping
end_time =  datetime.datetime.now()
print("Finished scraping", len(df_repo_stats), "of", len(slugs), "records at", end_time)
print("It took", end_time-start_time, "to run.")

In [11]:
# read in the file and check
df_repo_stats.head()
#print(df_repo_stats)
#print(df_repo_stats.isna().sum())
#print(df_repo_stats.shape)

Unnamed: 0,forks,slug,stars,topics,watchers
0,1.0,petejohanson/hyena,1.0,[],1.0
1,0.0,stuartwan/imlazyone.github.io,0.0,[],1.0
2,311.0,keycloak/keycloak-nodejs-connect,402.0,[],26.0
3,155.0,TommyLemon/APIAuto,910.0,"[apijson, vuejs2, document-database, autotesti...",26.0
4,245.0,seg/2016-ml-contest,144.0,"[machine-learning, data-science, geoscience, c...",30.0


In [12]:
# save csv
#df_repo_stats.to_csv(r'/home/zz3hs/git/dspg21oss/data/dspg21oss/new_repo_stats_0712_3.csv', index = False)   
