***Mining Research Publications***    
*Goal*: gather publications containing a specified keyword  
*Supported code collaboration and version control tools*: arXiv

*Default Parameters in the configuration file*:  
  - pub_sources: arxiv
  - pub_keywords: github.com

In [None]:
import sys
import yaml
from modules.database import Collection
from modules.arxiv_harvester import ArXivHarvester

**API for digital libraries and e-print repositories**  
The required parameter for the services to be harvested are looked up in the supported sources dictionary. Here is the name of the harvester class noted and the information whether an authentication token is required.

In [None]:
supported_sources = {
    'arxiv' : {
        'token' : False,
        'class' : 'ArXivHarvester'
    }
}

**Load Required Parameter**  
All neccessary parameter for the publication harvesting process are specified in the associated configuration file, located in the same folder. The specified publication sources are checked against the supported sources. A notification about skipping unsupported sources is printed. Also sources, that require an authentication token and the corresponding token is not specified, are skipped. The indicated authentication tokens are stored in the corresponding dictionary entry. The MongoDB database is used to store the metadata and additional information of the harvested publications. If the given database table does not exist, it has to be confirmed whether a new database table with this name should be created or an alternative database table may be specified.

In [None]:
sources = []
with open("config.yaml", 'r') as stream:
    params = yaml.safe_load(stream)

for param in params['pub_sources']:
    if param not in supported_sources:
        print("excluded, as not supported: ", param)
    elif supported_sources[param]['token'] == 'true' and not params['authentication'][param]:
        print("excluded, as token is needed: ", param)
    else:
        sources.append(param)

# initialize database
# check if database table exists
publication_collection = Collection('publications')

# keywords
keywords = params['pub_keywords']

**Harvesting publications**  
For every supported digital library and e-print repository, specified in the configuration file, the corresponding harvester class is instantiated. Iterating over all given keywords, the harvest method of the class is called with a keyword and the MongoDB collection instance for publications.  

In [None]:
for source in sources:
    current = getattr(sys.modules[__name__], supported_sources[source]['class'])()
    for keyword in keywords:
        current.harvest(keyword, publication_collection)