# Dataset Description   
This Notebook describes the main characteristics of the harvested software repositories, the harvested publications, and the composed research software repositories set.  

In [None]:
import matplotlib.pyplot as plt
import collections
import modules.database as db

## Connect to Database Collections  
The harvested GitHub repositories and research publications are stored in MongoDB database
collections. Moreover, collections for the identified research software repositories
and their corresponding publications are created. Both repositories and publications get their
separate database table. To link publications and repositories, each repository has a
list of DOI names and each publication has a list of repository names. If the 
database tables, specified in the config file, do not exist, it has to be confirmed 
whether a new database table with this name should be created or an alternative database 
table may be specified.

In [None]:
repo_table = db.RepoCollection()
publication_table = db.Collection('publications')
rs_repo_table = db.RsRepoCollection()
rs_artifact_table = db.RsArtifactCollection()

## Set Basic Parameters for Analysis

In [None]:
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 14})

## Auxiliary Functions    
Except the **percentage** function, the auxiliary functions get a database query and a description of the selected characteristic as parameters. According to the query, the number of reposiories with the specific characteristic is requested and considered in relation to the overall sample. The extracted research artifacts, stored in the **rs_artifacts** collection are also consiered in relation to all artifacts identified by a DOI and to all artifacts with harvested metadata.

In [None]:
def percentage(part, whole):
    return round(100 * float(part)/float(whole), 2)

def publication_desc(query, desc):
    total = publication_table.get_number_of_entries({})
    request = publication_table.get_number_of_entries(query)
    print(desc, request, '(', percentage(request, total),'%)')

def artifact_desc(query, desc):
    total = rs_artifact_table.get_number_of_entries({})
    doi = rs_artifact_table.get_number_of_entries({'identifier.mode':'doi'})
    meta = rs_artifact_table.get_number_of_entries({'$and':[{'identifier.mode':'doi'},{'source':'Crossref'}]})
    request = rs_artifact_table.get_number_of_entries(query)
    print(desc, request, '(Overall sample: ', percentage(request, total),'%)', '(DOI sample: ', percentage(request, doi),'%)', '(Metadata sample: ', percentage(request, meta),'%)')

def rs_repo_desc(query, desc):
    total = rs_repo_table.get_number_of_entries({})
    request = rs_repo_table.get_number_of_entries(query)
    print(desc, request, '(', percentage(request, total),'%)')

## Harvested Publications
In total, 33,187 publications were harvested, 24,240 from ACM (73.04%) and 8,947 from arXiv (26.96%). The DOI is the preferred identifier for the publications. It is provided for most of the publications (78.61%). The remaining publications are identified by their arXiv ID.

In [None]:
publication_desc({}, 'Total number of harvested publications: ')
publication_desc({'source':'acm'}, 'Number of harvested ACM publications: ')
publication_desc({'source':'arxiv'}, 'Number of harvested arXiv publications: ')
publication_desc({'$and':[{'source':'acm'},{'$or':[{'doi':{'$ne':''}},{'doi_from_link':{'$ne':''}}]}]}, 
                 'Number of ACM publications with a DOI: ')
publication_desc({'$and':[{'source':'arxiv'},{'doi':{'$ne':''}}]}, 
                 'Number of arXiv publications with a DOI: ')
publication_desc({'$or':[{'doi':{'$ne':''}},{'doi_from_link':{'$ne':None}}]}, 'Number of publications with a DOI: ')

## Harvested Research Software Artifacts

### Metadata and ISSN

The harvested identifiers sum up together with the DOIs extracted from the GitHub repositories to 118,161 referenced publications. In the case of an existing DOI, the metadata are gathered. The artifacts identified by a DOI are grouped in the DOI subsample. This subsample is refined by the Metadata subsample, which consists of all artifacts identified by a DOI and with harvested metadata. The metadata requests were successful for most of the publications (83.22%). 94.37% of the metadata sample have an assigned ISSN or ISBN. This corresponds to 78.54% in the DOI sample. On the basis of the ISSN and ISBN, the research domain of a publication, and thus of a research software repository, is derived. An assignment of the research domain is obtained for half of the publications. 

In [None]:
print('Harvested research software artifacts: ', rs_artifact_table.get_number_of_entries({}))
print()
artifact_desc({'identifier.mode':'doi'}, 'Number of artifacts identified by a DOI: ')
artifact_desc({'identifier.mode':'arxiv_id'}, 'Number of artifacts identified by an arxiv_id: ')
artifact_desc({'identifier.mode':'title'}, 'Number of artifacts identified by a title: ')
print()
artifact_desc({'source':'Crossref'}, 'Number of artifacts with available metadata: ')
print()
artifact_desc({'$or':[{'ISSN':{'$exists':True}},{'ISBN':{'$exists':True}}]}, 'Number of artifacts with an assigned ISSN or ISBN: ')
artifact_desc({'ISSN':{'$exists':True}}, 'Number of artifacts with an assigned ISSN: ')
artifact_desc({'ISBN':{'$exists':True}}, 'Number of artifacts with an assigned ISBN: ')
artifact_desc({'main_subject':{'$exists':True}}, 'Number of artifacts with an assigned research domain: ')

### Artifact Type
The type specification is one of the available metadata information. The number of publications with a missing type specification amounts to a share of 16.8% publications in the DOI sample. The majority of the harvested publications is classified as journal article, followed by proceedings articles. The aggregated category “Others” summarizes – among other types – books, datasets, journals, and monographs (see long list below in the output). Due to the contained datasets, components, and other specification the collection is named rs_artifacts.

In [None]:
overview = rs_artifact_table.compose_type()
types = {'others':0}
keys = []
for elem in overview:
    for item in elem['counts']:
        keys.append(item['k'])
        if not item['k']:
            types['not specified'] = item['v']
        elif item['v'] < 1000 or item['k'] == 'other':
            types['others'] = types['others'] + item['v']
        else:
            types[item['k']] = item['v']

print('All type specifications: ', keys, '\n')
types = {key: value for key, value in sorted(types.items(), key=lambda item: item[1],reverse=True)}

plt.pie([float(v) for v in types.values()], labels=[k for k in types.keys()],
        autopct='%1.1f%%', counterclock=False, startangle=0, pctdistance=0.8, colors=(plt.cm.Set3.colors*2))

plt.axis('equal')
# plt.savefig("doiTypes.pdf", bbox_inches = "tight")
print('Artifact types, as stated in the DOI metadata:')
plt.show()

## Harvested Research Software Repositories

From the harvested publications 148,560 repository names were extracted, summing up with the 70,227 harvested GitHub repositories to 218,787 research software candidates. After accomplishing the filtering and classification of the repositories, 74,257 research software repositories were identified. These repositories are only a tiny part of more than 50 million public repositories hosted on GitHub (https://api.github.com/search/repositories?q=is:public). Due to a negative lifespan, 14 repositories are identified as outliers and removed from the sample. More than a third of the research software repositories, in the following referred to as repositories, is set up in recent years.

In [None]:
print('Number of harvested repositories: ', repo_table.get_number_of_entries({}))
print('Classified research software repositories: ', rs_repo_table.get_number_of_entries({}))

In [None]:
repos = rs_repo_table.get_entries({})
years = {}
for repo in repos:
    year = int(repo['first_commit'].split('-')[0])
    if year > 2003 and year < 2021:
        if repo['first_commit'].split('-')[0] in years:
            years[repo['first_commit'].split('-')[0]] = years[repo['first_commit'].split('-')[0]] + 1
        else:
            years[repo['first_commit'].split('-')[0]] = 1
years = collections.OrderedDict(sorted(years.items()))
print('Repos created in 2020: ', years['2020'], 'in percent: ', percentage(years['2020'], rs_repo_table.get_number_of_entries({})))
print('Repos created in 2019: ', years['2019'], 'in percent: ', percentage(years['2019'], rs_repo_table.get_number_of_entries({})))
plt.bar(range(len(years)), list(years.values()), align='center', tick_label=list(years.keys()))
plt.xticks(range(len(years)), list(years.keys()), rotation="vertical")

plt.ylabel('Repositories')
plt.xlabel('Year of First Commit')

# plt.savefig("allYear.pdf", bbox_inches = "tight")

plt.show()

### General Characteristics
Among these harvested repositories, there are only a few (1.54%) archived repositories. Nearly a half of the repositories (49.86%) is forked, whereas not many repositories (7.85%) are forks. In the metadata of half of the repositories (54.6%) it is stated that they have a license. And nearly all repositories (83.59%) have an assigned primary language. 

In [None]:
rs_repo_desc({'license':{'$ne':None}}, 'Number of repositories with a license: ')
rs_repo_desc({'archived':True}, 'Number of archived repositories: ')
rs_repo_desc({'forks':{'$gt':0}}, 'Number of forked repositories: ')
rs_repo_desc({'fork':True}, 'Number of repositories that are a fork: ')
rs_repo_desc({'language':{'$ne':None}}, 'Number of repositories with an assigned primary language: ')

### Primary Programming Language
The most commonly used programming language is Python (26.1%), followed by R (11.1%) and C++ (9.6%).

In [None]:
repos = rs_repo_table.get_entries({})
languages = {'Others': 0}
for repo in repos:
    if repo['language']:
        if repo['language'] in languages:
            languages[repo['language']] = languages[repo['language']] + 1
        else:
            languages[repo['language']] = 1
others = 0 
remove = []
for k,v in languages.items():
    if v < 2000:
        others = others + v
        remove.append(k)
for key in remove:
    del languages[key]
languages = {key: value for key, value in sorted(languages.items(), key=lambda item: item[1],reverse=True)}
languages['Others'] = others
    

plt.pie([float(v) for v in languages.values()], labels=[k for k in languages.keys()], counterclock=False,
        autopct='%1.1f%%', startangle=90, pctdistance=0.8, colors=plt.cm.Set3.colors)

plt.axis('equal')
# plt.savefig("languages.pdf", bbox_inches = "tight")
print('Distribution of assigned primary languages:')
plt.show()

### Assigned Research Artifacts
Most of the repositories (73.5%) are related to one research publication. Considerably more seldom are two linked publications (17%), and very few repositories are associated to five (1%) or more (2.3%) publications.

In [None]:
repos = rs_repo_table.get_entries({})
num_artifacts = {'Others': 0}
for repo in repos:
    size = len(repo['references'])
    if size in num_artifacts:
        num_artifacts[size] = num_artifacts[size] + 1
    else:
        num_artifacts[size] = 1
        
others = 0 
remove = []
for k,v in num_artifacts.items():
    if v < 500:
        others = others + v
        remove.append(k)
for key in remove:
    del num_artifacts[key]

num_artifacts = {key: value for key, value in sorted(num_artifacts.items(), key=lambda item: item[1],reverse=True)}
num_artifacts['Others'] = others
    

plt.pie([float(v) for v in num_artifacts.values()], labels=[k for k in num_artifacts.keys()], counterclock=False,
        autopct='%1.1f%%', startangle=0, pctdistance=0.8, colors=plt.cm.Set3.colors)

plt.axis('equal')
# plt.savefig("numRef.pdf", bbox_inches = "tight")
print('Distribution of assigned research artifacts:')
plt.show()

### Type of Publication Reference

Each repository is characterized by its type of publication reference. The reference allows the allocation of a repository to one of three groups.   
(1) The GitHub group consists of repositories that are gathered from GitHub and contain a valid DOI reference.   
(2) Repositories referenced in an ACM publication, are part of the ACM group.    
(3) The arXiv group comprises all repositories that are linked in arXiv publications.

In [None]:
total = rs_repo_table.get_number_of_entries({})
groups = {}
category = ['github', 'acm', 'arxiv']

for elem in category:
    part = rs_repo_table.get_number_of_entries({ "group": { '$in': [elem] } })
    groups[elem] = part
    print(percentage(part, total), '% (in total',
          part, ') number of repositories in the', elem, 'set')

plt.bar(range(len(groups)), list(groups.values()), align='center', tick_label=list(groups.values()))
plt.xticks(range(len(groups)), list(groups.keys()))
plt.xlabel('Type of publication reference')
plt.ylabel('Repositories')
# plt.savefig("group.pdf", bbox_inches = "tight")

plt.show()