***Mining Source Code Repositories***    
*Goal*: harvest repositories containing specified keywords    
*Supported code collaboration and version control tools*: GitHub  

*Default Parameters in the configuration file*:  
  - repo_sources: github
  - repo_keywords: doi+10, doi+10+in:readme
  - start_date: "2008-01-01"
  - end_date: "2020-10-01"
  - delta: days=7
  - search_repos
  - readme   

In [None]:
import sys
import time
from datetime import datetime
import yaml
from modules.database import RepoCollection
from modules.github_harvester import GitHubHarvester
from IPython.display import clear_output, display

**Load Required Parameter**  
All neccessary parameter for the repository harvesting process are specified in the associated configuration file, located in the same folder. The specified repository hosting services are checked against the supported services. A notification about skipping unsupported services is printed. Also services, that require an authentication token and the corresponding token is not specified, are skipped. The indicated authentication tokens are stored in the corresponding dictionary entry.   
The MongoDB database is used to store the metadata and additional information of the harvested repositories. If the given database table does not exist, it has to be confirmed whether a new database table with this name should be created or an alternative database table may be specified.

In [None]:
sources = []

# load parameters from the config file
with open("config.yaml", 'r') as stream:
    params = yaml.safe_load(stream)

# check whether the specified sources are supported and, 
# if required, an authentication token is given 
for param in params['repo_sources']:
    if param not in params['supported_sources']:
        print("excluded, as not supported: ", param)
    elif (params['supported_sources'][param]['token_required'] == 'true' 
          and not params['authentication'][param]):
        print("excluded, as token is needed: ", param)
    else:
        sources.append(param)

# check if database table exists
repo_collection = RepoCollection()

**Harvesting Repositories**  
The configuration file contains two flags indicating whether the repository metadata and readme files should be harvested. At the beginning, this flag is checked for the metadata harvesting. For each repository hosting service (source) the associated harvester class is instantiated. The search process iterates over all specified keywords. As the number of search results exceeds the number of returned results (limited to 1000), the search period is splitted into search intervals, whose length may be defined in the configuration file.   
After requesting the repositories, the metadata are stored in the database table. Due to the overlapping search terms, repositories may be returned twice. In the case, that a repository already is inserted, the current keyword is added to the repository's keyword list. If no entry for the repository exists, its metadata will be inserted in combination with additional information, like the harvesting date, its hosting service, and the associated search term.   
The GitHub REST API limits the number of search results to 1,000, grouped into maximum ten pages with maximum 100 repositories. The HTTPs header contains the URL of the following page. This link is extracted and reuqested.  
After each API call the iteration is paused.

In [None]:
if 'search_repos' in params['repo_harvester']:
    for source in sources:

        remaining_search_requests = -1
        # instantiate harvester class
        current = getattr(
            sys.modules[__name__],
            params['supported_sources'][source]['class'])(params['authentication'][source])
        next_url = None

        for key in params['repo_keywords']:
            # use for start, end, and interval the parameter from the config file
            # to create search intervals
            for interval in current.create_interval(params['start_date'],
                                                    params['end_date'],
                                                    params['delta']):

                while True:
                    # progress indicator
                    clear_output(wait=True)
                    if next_url:
                        print('API call: ', next_url)
                    else:
                        print('API call for keyword', key, 'and interval', interval)

                    # request repositories
                    response = current.get_search_results(
                        remaining_search_requests,
                        next_url,
                        key,
                        interval)

                    # store metadata of each repository in db
                    if response:
                        for elem in response.json()['items']:
                            repo_collection.save_repo(elem, key, source, datetime.now())

                        # check whether further pages are available, and if so set next request url
                        # link may contain first, last, prev, and next URL
                        if 'link' in response.headers:
                            next_url = current.get_next_page(response.headers['link'].split(","))
                        else:
                            next_url = None

                        # set the remaining rate limit
                        remaining_search_requests = int(response.headers['X-Ratelimit-Remaining'])
                    time.sleep(current.get_search_sleep_time())

                    if not next_url:
                        break

**Harvest Readme Files**  
In addition to the repository metadata, the Readme file may be harvested. It is intended to provide the context of the specified search terms, to extract them in the further processing steps. For each repository without an existing readme field, the Readme file is requested. As the existing repositories are not sorted by their hosting services, the associated harvester class is instantiated by means of the repository source information. If a repository does not contain a Readme file the note 'empty readme' is added to the repositories entry.     

In [None]:
if 'readme' in params['repo_harvester']:

    total = repo_collection.get_number_of_entries({'readme':{"$exists" : False}})
    counter = 0
    remaining_core_requests = -1
    print('Started harvesting Readme files...')

    while True:

        # progress indicator
        counter = counter + 1
        if counter % 100 == 0 or counter == total:
            clear_output(wait=True)
            print("Processed {0} of {1}".format(counter, total))

        # look up one repository without a readme field
        repo = repo_collection.get_entry({'readme':{"$exists" : False}})
        if repo:
            current = getattr(
                sys.modules[__name__],
                params['supported_sources'][repo['source']]['class'])(params['authentication'][repo['source']])
        else:
            break

        # request Readme file of the repository
        readme, remaining_core_requests = current.get_api_response('readme',
                                                                   str(repo['id']),
                                                                   remaining_core_requests)

        # add readme content to repository entry   
        post = {"$set" : {'readme': readme.text}} if readme else {"$set" : {'readme': 'empty readme'}}
        repo_collection.mod_entry({'id': repo['id']}, post)

        time.sleep(current.get_core_sleep_time())