***Research Software Identifier***
*Prerequisites*:
  - MongoDB
  - Jupyter Notebook
  - Packages (see requirements.txt)
  - Configuration File (in the same folder)

*Getting started*:
  - Create a virtual environment
  - Install required packages (pip install requirements.txt)
  - Specify parameter in the configuration file
  - Run this Notebook
  
*Try on Binder*:
  - link

*Required Data* (no new data are harvested):
  - DB table with repositories: full_name, description, readme
  - DB table with publications: doi | arxiv_id | title, text fragments

In [None]:
import re
import sys
import time
import modules.database as db
from datetime import date, timedelta, datetime
import yaml
import requests
from IPython.display import clear_output, display
from dateutil.relativedelta import relativedelta
import modules.auxiliary_functions as aux
from modules.github_harvester import GitHubHarvester

**Load Required Parameter**  
All neccessary parameters for the identification process of research software 
are specified in the configuration file, located in the same folder.
The specified repository hosting services are checked against the
supported services. A notification about skipping unsupported services is printed.
Also services, that require an authentication token and the corresponding token
is not specified, are skipped. The indicated authentication tokens are stored
in the corresponding dictionary entry.  
The MongoDB database is used to store the identified research software repositories
and their corresponding publications. Both repositories and publications get their
separate database table. To link publications and repositories, each repository has a
list of DOI names and each publication has a list of repository names. If the given
database tables do not exist, it has to be confirmed whether a new database table
with this name should be created or an alternative database table may be specified.
Only the database table for the journal subject categorization has to be present.

In [None]:
# load parameters from configuration file
with open("config.yaml", 'r') as stream:
    params = yaml.safe_load(stream)

for param in params['repo_sources']:
    if param not in params['supported_sources']:
        print("excluded, as not supported: ", param)
    elif (params['supported_sources'][param]['token_required'] and not params['authentication'][param]):
        print("excluded, as token is needed: ", param)

# instantiate MongoDB database collections and check if collections exist
repo_table = db.RepoCollection()
publication_table = db.Collection('publications')
rs_repo_table = db.RsRepoCollection()
rs_publication_table = db.RsArtifactCollection()

**Add Repositories and Publications**   
If no data are available or the existing data should be extended,
the comment sign before the parameters new_repositories and new_publications 
can be removed and the corresponding Notebook is executed.

In [None]:
# check whether new repositories and /or publications are required
# and run the corresponding harvester
if 'new_repositories' in params['rsidentifier']:
    %run repository_harvester.ipynb
if 'new_publications' in params['rsidentifier']:
    %run publication_harvester.ipynb

**Look up DOIs in the Repositories**  
One main criterium for research software is a referenced DOI.
Therefore, the gathered research software candidates are reviewed
for a DOI or shortDOI by iterating over the Repositories database table.
The extraction  is done by the auxiliary function extract_doi, that returns
a list of DOIs. If the list is empty, the repository is not assumed to be a
research software repository and is not inserted into the rsRepositories
database table. All other repositories receive an entry in this database table.
If a repository has already an entry in the database table, its reference list
is updated. Each found DOI is inserted to the rsPublications database table with
the provoking repository in its list of repositories.

In [None]:
if 'dois' in params['rsidentifier']:

    total = repo_table.get_number_of_entries({})
    counter = 0
    print('Started extracting DOIs ...')

    for repo in repo_table.get_entries({}):

        # progress indicator
        counter = counter + 1
        if counter % 500 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        dois = []
        elems = []

        # look up DOI references in the repository description if exists
        if repo['description']:
            elems = elems + aux.extract_doi(repo['description'])
        # look up DOI references in the repository Readme file
        if repo['readme']:
            elems = elems + aux.extract_doi(repo['readme'])

        if not elems:
            continue
        # remove duplicates and add id type
        for elem in list(set(elems)):
            dois.append({'id': elem, 'mode': 'doi'})

        # add repository to rsRepositories database table
        rs_repo_table.save_repo(repo['id'],
                                repo['full_name'],
                                dois,
                                repo['source'],
                                repo['source'],
                                repo['language'])

        # add DOIs to rsPublications database table
        rs_publication_table.save_publication(dois,
                                              repo['full_name'])

***Add Repository Names from Publications***  
The repository names are extracted from the provided text fragments
(Publications database table) and inserted to the rsRepositories and
rsPublications database tables. If an entry for a repository already exists,
the publication id is added to its reference list. The same is done for
the publication entry, with the difference that the repository name is added
to the repository list of the entry.

In [None]:
if 'repo_name' in params['rsidentifier']:

    # regular expression for a full repository name in a GitHub URL
    PATTERN_REPO = r'(?i)github\.com/ ?([a-z0-9][a-z0-9-]*/[a-z0-9_\.-]+)'

    counter = 0
    total = publication_table.get_number_of_entries({})
    fragments = ['summary', 'full_text_extract', 'summary_detail', 'arxiv_comment']    
    print('Started extracting repository names ...')

    for pub in publication_table.get_entries({}):

        # progress indicator
        counter = counter + 1
        if counter % 500 == 0 or counter == total:
            clear_output(wait=True)
            print("Processed {0} of {1}".format(counter, total))
        repos = []

        # find GitHub repository names in the available text fragments
        for fragment in [frag for frag in fragments if frag in pub]:
            repos.extend(name for name in re.findall(PATTERN_REPO,
                                                     pub[fragment])
                         if name not in repos)

        # add repositories to the research software database tables
        for repo in repos:
            # add information to rsRepositories
            ident = aux.create_reference_entry(pub, True)
            rs_repo_table.save_repo(None,
                                    repo,
                                    [ident],
                                    'github',
                                    pub['source'])

            # add information to rsPublications
            rs_publication_table.save_publication([ident], repo)

**Add Repositories of linked Owners**   
Besides repositories, in the publications are also repository 
owners referenced (https://github.com/{owner}). For these owners
all their repositories are requested and added to the database table.
This is done in two separate steps to prevent MongoDB cursor timeouts, 
request owner names twice, and to have access points for the start after 
intended and unintended breaks. Initially, the owner names are extracted 
and stored in a dictionary, together with the information of the associated publication
and a flag whether this user is already added to the database table. 
To avoid losing the dictionary data, the following two cells ought to be 
executed one after the other.

In [None]:
if 'user_name' in params['rsidentifier']:

    # regular expression for a repository owner in a GitHub URL
    PATTERN_USER = r'(?i)github\.com/ ?([a-z0-9][a-z0-9-]*)/?(?:\s|\.|\)|\'|\"|$|\]|\;|\}|\,)'

    total = publication_table.get_number_of_entries({})
    fragments = ['summary', 'full_text_extract', 'summary_detail', 'arxiv_comment']
    counter = 0
    remaining_requests = -1
    next_url = None
    owners = {}
    print('Started extracting user names...')

    for pub in publication_table.get_entries({}):

        # progress indicator
        counter = counter + 1
        if counter % 500 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))
        
        users = []

        # find GitHub repository owner names in the avaliable text fragments
        for fragment in [frag for frag in fragments if frag in pub]:
            users.extend(name for name in re.findall(PATTERN_USER,
                                                     pub[fragment])
                         if name not in users)

        for user in users:
            owners.update({user : {'publication': pub, 'harvested': False}})

In [None]:
if 'user_name' in params['rsidentifier']:

    counter = 0
    total = len(owners)
    # list of GitHub site names whose github link equals a valid owner link
    github_sites = ['explore', 'topics', 'trending', 'collections', 'events',
                    'features', 'join', 'login', 'search', 'about', 'showcases', 
                    'marketplace']
    print('Started harvesting repositories of the given user names...')

    for user, infos in owners.items():
        if not infos['harvested']:

            # progress indicator
            counter = counter + 1
            if counter % 25 == 0 or counter == total:
                clear_output(wait=True)
                print("processed {0} of {1}".format(counter, total))
        
            # check if user candidate is one of the GitHub site names
            if user in github_sites:
                continue
            # instantiate the harvester class of the repository hosting service
            # to get all repositories of a user
            current = getattr(
                sys.modules[__name__],
                params['supported_sources']['github']['class'])(params['authentication']['github'])
            
            while True:
                # request repositories
                response, remaining_requests = current.get_api_response(
                    'user',
                    user,
                    remaining_requests,
                    next_url)
                # no valid user
                if not response:
                    break
                for repo in response.json():
                    repo_table.save_repo(repo, 'github.com', 'github', datetime.now())
                    ident = aux.create_reference_entry(infos['publication'], True)
                    rs_repo_table.save_repo(repo['id'],
                                            repo['full_name'],
                                            [ident],
                                            'github',
                                            infos['publication']['source'],
                                            repo['language'])
                    # check if current publication is in rsPublications database table
                    rs_publication_table.save_publication([ident],
                                                          repo['full_name'])
                
                # check whether further pages are available, and if so set next request url
                if 'link' in response.headers:
                    next_url = current.get_next_page(response.headers['link'].split(","))
                else:
                    next_url = None
                time.sleep(current.get_core_sleep_time())
            
                if not next_url:
                    break
            owners[user]['harvested'] = True

**Request Metadata**  
When requesting the repositories of an owner, the repository metadata are also provided within the API response. However, this does not apply for the extraction of the repository names from the publications text fragments. Here, only the repository name with its associated publication is added to the research software repositories database table. To confirm the repository names and simultanously harvest their metadata, for each repository the metadata are requested by the API of its hosting service.   
Via the regular expression not always the exact name is extracted, for instance, the name may end with a full stop or a closing bracket. So, if a 404 is returned from the API, the suffix of the name is checked and non alphanumeric characters are removed, as well as some specific words, like .git, .The, or meta. Then the shortened repository name is requested.  

In [None]:
if 'metadata' in params['rsidentifier']:

    remaining_requests = -1
    total = rs_repo_table.get_number_of_entries({})
    counter = 0
    print('Started requesting repository metadata ...')

    for repo in rs_repo_table.get_entries({}):
        reject = True

        # progress indicator
        counter = counter + 1
        if counter % 100 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        # if repo metadata already requested, continue with next repo
        repo_meta = repo_table.get_entry({'full_name': repo['full_name']})
        if repo_meta:
            # check whether returned id already is in rs_repo_table
            # if so, merge reference lists and remove duplicate entry
            if not rs_repo_table.merge_duplicates(repo_meta, repo) and not repo['id']:
                rs_repo_table.mod_entry({'_id':repo['_id']}, {'$set':{'id': repo_meta['id']}})
            continue

        # instantiate harvester class
        current = getattr(
            sys.modules[__name__],
            params['supported_sources'][repo['source']]['class'])(params['authentication'][repo['source']])

        name = repo['full_name']
        while name:
            response, remaining_requests = current.get_api_response(
                'metadata', 
                name, 
                remaining_requests)
            time.sleep(current.get_core_sleep_time())

            # repository has metadata
            if response and response.json():
                repo_table.save_repo(response.json(), 'github.com',
                                     repo['source'], datetime.now())
                # check whether returned id exists already in db table
                if not rs_repo_table.merge_duplicates(response.json(), repo):
                    rs_repo_table.mod_entry(
                        {'_id': repo['_id']},
                        {'$set':
                         {'id': response.json()['id'],
                          'full_name': response.json()['full_name'],
                          'language': response.json()['language']}})
                reject = False
                break
            # repository does not exist, check whether the last char
            # is not alphanumeric
            name = aux.check_name_suffix(name)
            meta_repo = repo_table.get_entry({'full_name':name})
            if meta_repo:
                # check whether returned id already is in rs_repo_table
                # if so, merge reference lists and remove duplicate entry
                if not rs_repo_table.merge_duplicates(meta_repo, repo) and not repo['id']:
                    rs_repo_table.mod_entry({'_id':repo['_id']}, {'$set':{'id': meta_repo['id']}})
                reject = False
                break
        if reject:
            rs_repo_table.remove_entry({'_id':repo['_id']})

**Check Content of Repositories**  
Research Software contains source code. Therefore, repositories, without an assigned language and only consisting of Readme, License, and .gitignore files, are excluded from further processing.

In [None]:
if 'content' in params['rsidentifier']:

    remaining_requests = -1
    total = rs_repo_table.get_number_of_entries(
        {'$and': [
            {'checked_content':{'$exists':False}},
            {'language':None}]})
    counter = 0
    print('Started checking repository content...')

    for repo in rs_repo_table.get_entries(
        {'$and': [
            {'checked_content':{'$exists':False}},
            {'language':None}]}):

        # progress indicator
        counter = counter + 1
        if counter % 50 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        reject = True

        # instantiate harvester class
        current = getattr(
            sys.modules[__name__],
            params['supported_sources'][repo['source']]['class'])(params['authentication'][repo['source']])
        reject = False

        name = repo['full_name']
        while name:
            reject, remaining_requests = current.has_no_possible_source_code_files(
                name,
                remaining_requests)
            time.sleep(current.get_core_sleep_time())
            if not reject:
                rs_repo_table.mod_entry(
                    {'_id': repo['_id']},
                    {'$set': {'full_name': name, 'checked_content':True}})
                reject = False
                break

            # repository does not exist, check whether the last char
            # is not alphanumeric
            name = aux.check_name_suffix(name)
            if rs_repo_table.get_entry({'full_name':name}):
                break
        if reject:
            rs_repo_table.remove_entry({'_id':repo['_id']}) 

**Request the Commit Dates**   
For the computation of the sustainability indicators, the lifespan and the activity status, the first and the last commit of a repository are required. These are no constituents of the metadate and therefore, need to be requested. 

In [None]:
if 'commits' in params['rsidentifier']:

    total = rs_repo_table.get_number_of_entries({'first_commit':{"$exists" : False}})
    remaining_requests = -1
    counter = 0
    print('Started harvesting first commit dates...')

    while True:

        # progress indicator
        counter = counter + 1
        if counter % 25 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        repo = rs_repo_table.get_entry({'first_commit':{"$exists" : False}})
        if not repo:
            break

        current = getattr(
            sys.modules[__name__],
            params['supported_sources'][repo['source']]['class'])(params['authentication'][repo['source']])
        reject, first_commit, last_commit, remaining_requests = current.get_first_commit(
            repo['full_name'],
            remaining_requests)
        if reject:
            rs_repo_table.remove_entry({'full_name':repo['full_name']})
            continue

        first = datetime.strptime(first_commit, '%Y-%m-%dT%H:%M:%SZ')
        last = datetime.strptime(last_commit, '%Y-%m-%dT%H:%M:%SZ')
        lifespan = (last - first).days
        dt = (datetime.now()-relativedelta(years=1))
        live = last >= dt

        post ={"$set" : {
            "first_commit": first_commit,
            'last_commit': last_commit,
            "live": live,
            "lifespan": lifespan
            }}
        rs_repo_table.mod_entry({'_id': repo['_id']}, post)
        time.sleep(current.get_core_sleep_time())

**Request DOI Metadata**  
For the determination of the research area of a repository, the subject of its associated publications has to be identified. For publications the DOI metadata contain an ISSN, by that the subject of the journal, book, or conference proceeding may be looked up in the next step. Also in this case, via the regular expression not always the correct DOI is extracted. If a 404 is returned by the Crossref API the last non alphanumeric characters are cutted off and the DOI is checked again. 

In [None]:
if 'crossref' in params['rsidentifier']:

    if 'crossref' in params['authentication']:
        header = params['authentication']['crossref']
    else:
        header = None

    total = rs_publication_table.get_number_of_entries(
        {'$and': [
            {'identifier.mode': 'doi'}, 
            {'checked_doi': {'$exists': False}}]})
    remaining_requests = -1
    counter = 0
    print('Started gathering DOI metadata...')

    while True:
        try:
            pub = rs_publication_table.get_entry(
                {'$and': [
                    {'identifier.mode': 'doi'},
                    {'checked_doi': {'$exists': False}}]})
        except:
            rs_publication_table = db.RsPublicationCollection()
            print("DB reconnect ...")
            continue

        if not pub:
            break

        # progress indicator
        counter = counter + 1
        if counter % 50 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        present_in_db = False

        doi = pub['identifier']['id']
        last_doi = ''
        # check whether metadata for the given DOI may be gathered
        # for responses unequal to 200, the DOI name is truncated 
        # if it is not ending on an alphanumeric char
        while doi:
            call = 'https://api.crossref.org/works/' + doi
            response = requests.get(call, headers=header)
            
            # response from load balancer when the service is under heavy load
            if response.status_code in [503, 504]:
                time.sleep(60)
                continue
               
            if response.status_code == 200:
                break
            
            # check whether an alias exists
            query = 'https://doi.org/api/handles/' + doi
            reply = requests.get(query)
            if reply.status_code == 200:
                alias = [elem['data']['value'] 
                         for elem in reply.json()['values'] 
                         if elem['type'] == 'HS_ALIAS']
                if alias and last_doi != alias[0]:
                    last_doi = doi
                    doi = alias[0]
                    continue
            
            doi = aux.check_name_suffix(doi)
            if rs_publication_table.get_entry({'identifier.id':doi}):
                present_in_db = True
                break

        # the truncated version of the DOI is already in the database table
        if present_in_db:
            rs_publication_table.remove_entry({'identifier.id': pub['identifier']['id']})
            ident = None
            update_repos = True

        # metadata are gathered, if DOI is truncated, it is updated in the database tables
        elif response.status_code == 200:
            rs_publication_table.mod_entry({'_id': pub['_id']}, {'$set': response.json()['message']})
            rs_publication_table.mod_entry({'_id': pub['_id']}, {'$set': {'checked_doi': True}})
            update_repos = False
            if doi != pub['identifier']['id']:
                rs_publication_table.mod_entry({'_id': pub['_id']}, {'$set': {'identifier.id': doi}})
                ident = {'id': doi, 'mode': 'doi'}
                update_repos = True
        else:
            rs_publication_table.mod_entry({'_id': pub['_id']}, {'$set': {'checked_doi': True}})
            update_repos = False
        
        if update_repos:
            # remove unidentifiable DOI in the repositories reference list
            # and replace it with an arxiv id or a title, if available
            for repo in rs_repo_table.get_entries({ 'references.id': {'$eq' : pub['identifier']['id']}}):
                rs_repo_table.update_doi(repo['_id'], pub['identifier']['id'], ident)

**Journal Subject**   
For publications with a given DOI the journal subject is specified via
the ISSN and the Journal_Subject database table.

In [None]:
if 'subject' in params['rsidentifier']:
    
    publication_subjects_table = db.Collection('publication_subjects')

    total = rs_repo_table.get_number_of_entries(
        {'$and':
         [{'checked_subject': {'$exists': False}},
          {'references.mode':'doi'}]})
    remaining_requests = -1
    counter = 0
    not_found = 0
    print('Started looking up referenced publication subject...')


    while True:        
        repo = rs_repo_table.get_entry({'$and':
                        [{'checked_subject': {'$exists': False}},
                         {'references.mode':'doi'}]})
        if not repo:
            break

        # progress indicator
        counter = counter + 1
        if counter % 100 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))

        for ref in repo['references']:
            pub = rs_publication_table.get_entry({'identifier.id':ref['id']})
            queries = []
            
            if pub:
                if 'ISSN' in pub:
                    for issn in pub['ISSN']:                        
                        queries.append({'$or':
                                        [{'print_issn': issn.replace('-','')},
                                         {'e_issn':issn.replace('-','')}]})

                elif 'ISBN' in pub:
                    for isbn in pub['ISBN']:
                        queries.append({'$or':
                                        [{'print_isbn': isbn},
                                         {'e_isbn':isbn}]})

                elif 'container-title' in pub:
                    for title in pub['container-title']:
                        queries.append({'$or':
                             [{'title': title},
                              {'conference_name':title}]})
                if queries:
                    for query in queries:
                        subject = publication_subjects_table.get_entry(query)
                        if subject:
                            rs_publication_table.mod_entry({'_id':pub['_id']},
                                                           {'$set':
                                                            {'sub_subject': subject['subgroups'] if 'subgroups' in subject else [None],
                                                             'subject_asjc': subject['groups'],
                                                             'main_subject': subject['supergroup']
                                                            }})
                            rs_repo_table.save_subject(repo['_id'], subject)
        rs_repo_table.mod_entry({'_id': repo['_id']}, {'$set': {'checked_subject': True}})

**arXiv Subjects**   
arXiv provides its own category taxonomy and returns a primary category information with the search results for each publication.

In [None]:
if 'subject_arxiv' in params['rsidentifier']:
    
    arxiv_subjects_table = db.Collection('arxiv_subjects')

    total = rs_repo_table.get_number_of_entries(
        {'$and':[{'group':{'$in':['arxiv']}},
                 {'main_subject':{'$exists':False}}]})
    remaining_requests = -1
    counter = 0
    not_found = 0
    print('Started looking up arxiv publication subjects ...')

    for repo in rs_repo_table.get_entries(
        {'$and':[{'group':{'$in':['arxiv']}},
                 {'main_subject':{'$exists':False}}]}): 

        # progress indicator
        counter = counter + 1
        if counter % 100 == 0 or counter == total:
            clear_output(wait=True)
            print("processed {0} of {1}".format(counter, total))
        for ref in repo['references']:
            
            query = {'doi':ref['id']} if ref['mode'] == 'doi' else {'arxiv_id':ref['id']}

            pub = publication_table.get_entry(query)

            if pub and 'primary_category' in pub:
                subject = arxiv_subjects_table.get_entry({'short': pub['primary_category']})
                if subject:
                    rs_repo_table.save_subject(repo['_id'], subject)