# Searching for Jupyter Notebooks on Github

This is the first notebook in a series designed to analyze over 1 million notebooks hosted publicly on GitHub.

This particular notebook documents the first step of that process, searching notebooks on GitHub using GitHub's search API. See [Githubs API documentation](https://developer.github.com/v3/) and especially their [getting started guide](https://developer.github.com/v3/guides/getting-started/) for an introduction to using their search API.

This search began at 10a on Tuesday, July 11, 2017 and ended at 3.40p on Thursday July 13, 2017. This search spanned multiple days due to GitHub's Search limit of 30 queries per minute and abuse rate-limiting which reduced effective search rate to around 10-15 queries per minute as searching more frequently than that often prompted GitHub to periodically demand we wait 60 seconds before completing our next search.

In plain language, the query we used to search for notebooks asked for 1) files with "ipynb" in their path, 2) that ended with the ".ipynb" extension and 3) were written in the language "Jupyter Notebook". Since this query returns about 1.25 million results, and GitHub will only give access to 1,000 results at a time, the overall query had to be split into over a thousand subqueries. We decided to split the query by looking for files of different ranges of sizes (e.g. 0-10 bytes, 10-20 bytes, and so on).

More detailed information about what exact searches were performed when can be found in `logs/nb_metadata_query_log.txt`.

In [None]:
import os
import time
import json
import requests

import pandas as pd

## Search with Recursion (0 - 6954 bytes)
This first implementation of the search algorithm used recursion which python limits to ~1000 levels deep. Since I had to run many more search queries than 1,000, I opted for using a while loop instead (see below). I am keeping this code here for documentation's sake, but most of our API queries were run using the Recursion code below.

In [None]:
# http request authentication
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}

# initialize query request parameters
base_url = 'https://api.github.com/search/code?l=Jupyter+Notebook&q=ipynb+in:path+extension:ipynb'
page_url = '&per_page=100'


def search_range_on_github(search_range):
    # get search variables ready
    start = search_range[0]
    end = search_range[1]    
    size_url = get_size_url(start, end)
    search_url = base_url + size_url + page_url
    
    # query GitHub
    r = requests.get(search_url, headers = header)
    j = r.json()
    h = r.headers
    
    # if asked to slow down due to abuse rate limiting, wait the alloted time
    if h['Status'] == "403 Forbidden":
        wait = h['Retry-After']
        print("%s: Hit rate limit. Retry after %s seconds" % (h['Date'], h['Retry-After']))
        time.sleep(int(wait) + 1)
        search_range_on_github(search_range)
    
    # get data about the query time and number of results
    num_items = int(len(j['items']))
    num_results = int(j['total_count'])
    date = r.headers["Date"]
    
    # log the query
    log_string = "%s: %s-%s bytes %s results" % (date, start, end, num_results)
    write_to_log(log_string)
    
    # continue to search until less than 1,000 results    
    if num_results > 1000:
        new_search_range = set_search_range(search_range, num_results)
        schedule_search(new_search_range, h['X-RateLimit-Remaining'], h['X-RateLimit-Reset'])

    # when less than 1,000 results 
    else:
        # save the first page of data
        save_result(j, start, end, 1)
        log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, 1, num_items)
        write_to_log(log_string)
                  
        # traverse pagination if needed    
        if 'next' in r.links:
            next_url = r.links['next']['url']
            traverse_pagination(next_url, search_range, h['X-RateLimit-Remaining'], h['X-RateLimit-Reset'])
        
        # otherwise search the next range of notebook sizes       
        else:
            new_search_range = set_search_range(search_range, num_results)
            schedule_search(new_search_range, h['X-RateLimit-Remaining'], h['X-RateLimit-Reset']) # h['X-RateLimit-Limit']

def get_size_url(start, end):
    # generate the url sring needed to search a range of file sizes
    size_url = "+size:"
    if end == start:
        size_url += str(start)
        return size_url
    elif end > start:
        size_url += str(start) + ".." + str(end)
        return size_url
    else:
        print("Error: Search range end bigger than start %s - %s" % ())
            
def schedule_search(search_range, remaing_queries, reset_time):
    # delay the search to avoid being rate limited
    if remaing_queries == 0:
        time.sleep(reset_time - time.time() + 1)
    else:
        time.sleep(3)
    search_range_on_github(search_range)
    
def save_result(json_result, start, end, page):
    # save the result to a json file
    filename = "data/github_notebooks_%s_%s_p%s.json" % (start, end, page)    
    with open(filename, 'w') as json_file:
        json.dump(json_result, json_file)

def write_to_log(msg):
    f = 'nb_metadata_query_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

# even if a query returns 1000 results, these are broken up to only have 100 per page
# with a link between pages
def traverse_pagination(url, search_range, remaing_queries, reset_time):
    # wait to run query if needed
    if remaing_queries == 0:
        print("No queries left")
        time.sleep(reset_time - time.time() + 1)
    else:
        time.sleep(3)
    
    start = search_range[0]
    end = search_range[1]
    
    # get this page's data
    r = requests.get(url, headers = header)
    j = r.json()
    h = r.headers
    
    # if asked to slow down due to abuse rate limiting, wait the alloted time
    if h['Status'] == "403 Forbidden":
        wait = h['Retry-After']
        print("%s: Hit rate limit. Retry after %s seconds" % (h['Date'], h['Retry-After']))
        time.sleep(int(wait) + 1)
        traverse_pagination(url, search_range, remaing_queries, reset_time)
    
    # get data about the time and number of results returned by quer
    num_results = int(j['total_count'])
    num_items = len(j['items'])
    page_num = url.split('&page=')[1]
    date = r.headers["Date"]
    
    # save the results and write to log file
    save_result(j, start, end, page_num)
    log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, page_num, num_items)
    write_to_log(log_string)
    
    # keep iterating through links to next page of search results if multiple pages
    if "next" in r.links:
        next_url = r.links['next']['url']
        traverse_pagination(next_url, search_range, h['X-RateLimit-Remaining'], h['X-RateLimit-Reset'])
        
    else:
        new_search_range = set_search_range(search_range, num_results)
        schedule_search(new_search_range, h['X-RateLimit-Remaining'], h['X-RateLimit-Reset'])
    
def set_search_range(prior_range, prior_num_results):
    # look at prior search range
    prior_start = prior_range[0]
    prior_end = prior_range[1]
    prior_delta = prior_end - prior_start
    
    # if too many results last time, reset the range to try and only get 1,000 results in the query
    if prior_num_results > 1000:
        # if the prior range was just one file size, we can't go any smaller, so just move on
        if prior_delta == 0:
            #log that there are more than 1,000 files of this size, and move on
            log_string = "TOO MANY RESULTS: %s-%s bytes, %s results" % (prior_start, prior_end, prior_num_results)
            write_to_log(log_string)
            
            new_delta = 10
            start = prior_end + 1 
            end = start + new_delta
        else:
            new_delta = int(prior_delta / 2)
            start = prior_start
            end = start + new_delta
    # if under 1000 results, either increase the search size, or keep it the same size
    else: 
        if prior_num_results < 500:
            new_delta = int(prior_delta * 2)
            start = prior_end + 1
            end = start + new_delta
        else:
            new_delta = prior_delta
            start = prior_end + 1
            end = start + new_delta

    # stop conditions
    if int(start) > 100000000:
        print("Start value too high. May have reached end of search")
        return
    if int(end) > 100000000:
        end = 100000000
            
    return [int(start), int(end)]

In [None]:
# this is the starting search range, we changed it over time when a bug caused the code
# to crash and we needed to restart the process
search_range = [4675,4695]            
            
# do first search
search_range_on_github(search_range)

## Rewrite without recursion (6954 - 100,000,000 bytes)

Turns out, python has a maximum depth for recursion as a safety check (somewhere around 998 calls). I hit that limit about every 8% of the data set or 80,000 notebooks. To try to avoid hitting that limit, I have rewritten the code to use while loops rather than recursion. 

I started using this search algorithm around the 6934-6954 byte range. You will notice several instances in the log of iterating over the 6955-6975 dataset due to a bug in the code, but after fixing that it looks like things ran smoothly.

In [None]:
# http request authentication
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}

# initialize query request parameters
base_url = 'https://api.github.com/search/code?l=Jupyter+Notebook&q=ipynb+in:path+extension:ipynb'
page_url = '&per_page=100'


def run_query_loop(search_range):
    # set initial range
    start = search_range[0]
    end = search_range[1]
    
    # set inital loop management variables
    remaing_queries = 30 
    reset_time = time.time()
    limited = False
    wait_time = 0
    
    # so long as we have not reached the end (i.e. 100mb files)
    while start < 100000000:
        
        # limit how often we query to prevent rate-limiting by GitHub
        if limited:
            time.sleep(wait_time)
            limited = False
        elif remaing_queries == 0:
            time.sleep(reset_time - time.time() + 1)
        else:
            time.sleep(4)
        
        # compose search url
        size_url = get_size_url(start, end)
        search_url = base_url + size_url + page_url

        # query GitHub
        r = requests.get(search_url, headers = header)
        j = r.json()
        h = r.headers
        
        # handle abuse rate limiting
        if h['Status'] == "403 Forbidden":
            print("%s: Hit rate limit. Retry after %s seconds" % (h['Date'], h['Retry-After']))
            limited = True
            wait_time = int(h['Retry-After'])
            continue  

        date = r.headers["Date"]
        num_items = int(len(j['items']))
        num_results = int(j['total_count'])
        remaing_queries = h['X-RateLimit-Remaining']
        reset_time = int(h['X-RateLimit-Reset'])
            
        log_string = "%s: %s-%s bytes %s results" % (date, start, end, num_results)
        write_to_log(log_string)

        # continue to search until less than 1,000 results    
        if num_results > 1000:
            new_search_range = set_search_range([start, end], num_results)
            start = new_search_range[0]
            end = new_search_range[1]
            continue

        # when less than 1,000 results 
        else:
            # save the first page of data
            save_result(j, start, end, 1)
            log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, 1, num_items)
            write_to_log(log_string)

            # traverse pagination if needed    
            if 'next' in r.links:
                next_url = r.links['next']['url']
                another_page = True
                
                while another_page:
                    if limited:
                        time.sleep(wait_time)
                        limited = False
                    elif remaing_queries == 0:
                        time.sleep(reset_time - time.time() + 1)
                    else:
                        time.sleep(4)

                    rp = requests.get(next_url, headers = header)
                    jp = rp.json()
                    hp = rp.headers

                    if hp['Status'] == "403 Forbidden":                        
                        print("%s: Hit rate limit. Retry after %s seconds" % (hp['Date'], hp['Retry-After']))
                        limited = True
                        wait_time = int(hp['Retry-After'])
                        continue 

                    date = rp.headers["Date"]
                    num_results = int(jp['total_count'])
                    num_items = len(jp['items'])
                    page_num = next_url.split('&page=')[1]
                    remaing_queries = hp['X-RateLimit-Remaining']
                    reset_time = int(hp['X-RateLimit-Reset'])
                    

                    save_result(jp, start, end, page_num)
                    log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, page_num, num_items)
                    write_to_log(log_string)

                    if "next" in rp.links:
                        next_url = rp.links['next']['url']
                    else:
                        another_page = False
           
            # otherwise search the next range of notebook sizes       
            new_search_range = set_search_range([start, end], num_results)
            start = new_search_range[0]
            end = new_search_range[1]
            continue
                
    print("Loop has finished with range of %s-%s" % (start, end))

def get_size_url(start, end):
    size_url = "+size:"
    if end == start:
        if start > 100000000:
            size_url += ">"
            size_url += str(start)
            return size_url
        else:
            size_url += str(start)
            return size_url
    elif end > start:
        size_url += str(start) + ".." + str(end)
        return size_url
    else:
        print("Error: Search range end bigger than start %s - %s" % ())
    
def save_result(json_result, start, end, page):
    filename = "data/github_notebooks_%s_%s_p%s.json" % (start, end, page)    
    with open(filename, 'w') as json_file:
        json.dump(json_result, json_file)

def write_to_log(msg):
    f = 'nb_metadata_query_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()
    
def set_search_range(prior_range, prior_num_results):
    prior_start = prior_range[0]
    prior_end = prior_range[1]
    prior_delta = prior_end - prior_start
    
    if prior_num_results > 1000:
        if prior_delta == 0:
            #log that there are more than 1,000 files of this size, and move on
            log_string = "TOO MANY RESULTS: %s-%s bytes, %s results" % (prior_start, prior_end, prior_num_results)
            write_to_log(log_string)
            
            new_delta = 10
            start = prior_end + 1 
            end = start + new_delta
        else:
            new_delta = int(prior_delta / 2)
            start = prior_start
            end = start + new_delta
    else: 
        if prior_num_results < 500:
            new_delta = int(prior_delta * 2)
            start = prior_end + 1
            end = start + new_delta
        else:
            new_delta = prior_delta
            start = prior_end + 1
            end = start + new_delta

    if int(start) > 100000000:
        print("Start value too high. May have reached end of search")
        end = start
            
    return [int(start), int(end)]

In [None]:
search_range = [72088186,100000000]            
            
# do first search
run_query_loop(search_range)


And just to make sure, we'll do one final query of notebooks greater than 100mb in size to make sure we get them all. I believe Github limits file size to 100mb or less.

In [None]:
search_range = [100000000,1000000000]            
            
# do first search
run_query_loop(search_range)

## Additional Queries for missing data

While cleaning the data, we found we were missing some of the notebook metadata. Some query results were incomplete (i.e. GitHub did not return all the notebooks they said they would), or there were no items in the query response (possibly an error in our JSON writing code). 

Just in case it was an issue with the GitHub API, we searched for these ranges again. A better initial search alrorithm would check for these incomplete results at the time of the query and redo the search immediately. As it is, we may still miss some notebooks, or get duplicate notebooks if they have changed size in the time between the initial search and now.

In [None]:
missing =  ['13924..13964',
    '3111498..3193418',
    '11956..11996',
    '19659380..22280820',
    '3423..3439',
    '3846..3862',
    '4590..4606',
    '3998..4014',
    '3183..3199',
    '12530..12570',
    '127088..127728',
    '4234..4250',
    '3812..3828',
    '30355..30435',
    '4336..4352',
    '2877..2893',
    '22780..22860',
    '1337..1347',
    '1626..1642',
    '15523..15563',
    '3863..3879',
    '3913..3929',
    '166617..167257',
    '2435..2451',
    '4455..4471',
    '197385..198025',
    '3964..3980',
    '28330..28410',
    '4151..4166', 
    '4184..4200',
    '4201..4216',
    '4506..4521',
    '4521..4538',
    '3745..3760',
    '3761..3777',
    '3880..3900',
    '3901..3912',
    '3661..3680',
    '3681..3693',
    '141831..142471',
    '17368..17408',
    '626..646',
    '3576..3592',
    '3778..3794',
    '4100..4116',
    '10931..10971',
    '2418..2434',
    '1059..1079',
    '4184..4216',
    '28978..29058',
    '3745..3777',
    '103996..104316',
    '4506..4538',
    '10396..10436',
    '2792..2808',
    '4066..4082',
    '14867..14907',
    '17204..17244',
    '1864..1880',
    '4083..4099',
    '4167..4183',
    '4117..4133',
    '4370..4386',
    '3728..3744',
    '3694..3710',
    '4217..4233',
    '4438..4454',
    '4015..4031',
    '4387..4403',
    '18879..18919',
    '17368..17408',
    '136703..137343',
    '169181..169821',
    '164694..165334',
    '3829..3845',
    '3440..3456',
    '2005553..2046513',
    '30922..31002',
    '1231..1251',
    '26791..26871',
    '23023..23103',
    '287066..288346',
    '274256..275536',
    '13637..13677',
    '12120..12160',
    '3880..3912',
    '4641..4657',
    '141831..142471',
    '11423..11463',
    '2619968..2660928',
    '4489..4505',
    '3406..3422',
    '4032..4048',
    '3661..3693',
    '3559..3575',
    '3644..3660',
    '13596..13636',
    '4134..4166']

In [None]:
# http request authentication
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}

# initialize query request parameters
base_url = 'https://api.github.com/search/code?l=Jupyter+Notebook&q=ipynb+in:path+extension:ipynb'
page_url = '&per_page=100'

# set inital loop management variables
remaing_queries = 30 
reset_time = time.time()
limited = False
wait_time = 0

def save_result(json_result, start, end, page):
    filename = "data/github_notebooks_%s_%s_p%s.json" % (start, end, page)    
    with open(filename, 'w') as json_file:
        json.dump(json_result, json_file)

def write_to_log(msg):
    f = 'nb_metadata_query_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

for m in missing:
    
    not_started = True
    
    while not_started:
    
        start = int(m.split('..')[0])
        end = int(m.split('..')[1])

        # limit how often we query to prevent rate-limiting by GitHub
        if limited:
            time.sleep(wait_time)
            limited = False
        elif remaing_queries == 0:
            time.sleep(reset_time - time.time() + 1)
        else:
            time.sleep(7)

        # compose search url
        size_url = '+size:' + m
        search_url = base_url + size_url + page_url

        # query GitHub
        r = requests.get(search_url, headers = header)
        j = r.json()
        h = r.headers

        # handle abuse rate limiting
        if h['Status'] == "403 Forbidden":
            print("%s: Hit rate limit. Retry after %s seconds" % (h['Date'], h['Retry-After']))
            limited = True
            wait_time = int(h['Retry-After'])
            continue
        else:
            not_started = False
            print(m)

        date = r.headers["Date"]
        num_items = int(len(j['items']))
        num_results = int(j['total_count'])
        remaing_queries = h['X-RateLimit-Remaining']
        reset_time = int(h['X-RateLimit-Reset'])

        log_string = "%s: %s-%s bytes %s results" % (date, start, end, num_results)
        write_to_log(log_string)

        # save the first page of data
        save_result(j, start, end, 1)
        log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, 1, num_items)
        write_to_log(log_string)

        # traverse pagination if needed    
        if 'next' in r.links:
            next_url = r.links['next']['url']
            another_page = True

            while another_page:
                if limited:
                    time.sleep(wait_time)
                    limited = False
                elif remaing_queries == 0:
                    time.sleep(reset_time - time.time() + 1)
                else:
                    time.sleep(7)

                rp = requests.get(next_url, headers = header)
                jp = rp.json()
                hp = rp.headers

                if hp['Status'] == "403 Forbidden":                        
                    print("%s: Hit rate limit. Retry after %s seconds" % (hp['Date'], hp['Retry-After']))
                    limited = True
                    wait_time = int(hp['Retry-After'])
                    continue 

                date = rp.headers["Date"]
                num_results = int(jp['total_count'])
                num_items = len(jp['items'])
                page_num = next_url.split('&page=')[1]
                remaing_queries = hp['X-RateLimit-Remaining']
                reset_time = int(hp['X-RateLimit-Reset'])


                save_result(jp, start, end, page_num)
                log_string = "%s: %s-%s bytes p%s %s items" % (date, start, end, page_num, num_items)
                write_to_log(log_string)

                if "next" in rp.links:
                    next_url = rp.links['next']['url']
                else:
                    another_page = False

# End

And That's a wrap. We should now have metadata on 1.25 million jupyter notebooks on Github including their url, repository, owner, and so on. 

Now off to the [notebook metadata profiling](1_nb_metadata_cleaning.ipynb)  for an initial look at the data and basic  cleaning.