# Classification of Research Software Repositories

By means of the ISSN and ISBN of the related publications, the research domain for a research software repository is looked up by applying the Scopus source list, Scopus book title ist, and the All Science Journal Classification Codes (ASJC) of Scopus (https://www.scopus.com). The ASJC specifies the four top-level research subjects life sciences, social sciences, physical sciences, and health sciences. These are subdivided into 26 research fields. One additional research field is called ‘Multidisciplinary.’    
Since for the arXiv publications no ISSN or ISBN are available, the arXiv taxonomy (https://arxiv.org/category_taxonomy) is used to assign a research field to their referenced software repositories. The corresponding research subject is obtained by mapping the arXiv categories to ASJC categories.

In [None]:
import matplotlib.pyplot as plt
import pandas as pandas
import collections
import modules.database as db

## Connect to Database Collections 

In [None]:
repo_table = db.RepoCollection()
publication_table = db.Collection('publications')
rs_repo_table = db.RsRepoCollection()
rs_artifact_table = db.RsArtifactCollection()

## Set Basic Parameters for Analysis

In [None]:
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})

In [None]:
labelsSubject = ["Health Sciences", "Social Sciences", "Physical Sciences", "Life Sciences", "Multidisciplinary", "Interdisciplinary"]
labelsFields = ["Interdisciplinary", "Business, Management and Accounting", "Economics, Econometrics and Finance",
                "Engineering", "Materials Science", "Chemical Engineering", "Medicine", "Arts and Humanities",
                "Energy", "Social Sciences", "Agricultural and Biological Sciences", 
                "Physics and Astronomy", "Earth and Planetary Sciences", 'Others',
                "Mathematics", "Pharmacology, Toxicology and Pharmaceutics", 
                "Computer Science", "Environmental Science", "Dentistry", "Nursing", "Health Professions", 
                "Immunology and Microbiology", "Neuroscience", "Veterinary", "Psychology", "Decision Sciences", 
                "Multidisciplinary", "Astrophysics", "Mathematical Physics","Electrical Engineering and Systems Science", 
                "Condensed Matter", "Biochemistry, Genetics and Molecular Biology",
                "Condensed Matter","High Energy Physics - Phenomenology", "Quantitative Biology", "Quantum Physics",
                "General Relativity and Quantum Cosmology","Physics", "Nuclear Theory", "Quantitative Finance",
                "High Energy Physics - Theory","High Energy Physics - Lattice","High Energy Physics - Experiment",
                "Economics", "Nonlinear Sciences", 'Statistics', "Chemistry"]

coloursSubject = dict(zip(labelsSubject, plt.cm.Set2.colors[:len(labelsSubject)]))
coloursField = dict(zip(labelsFields, (plt.cm.Set3.colors*4)[:len(labelsFields)]))

## Auxiliary Functions
The **cursor_to_dict** function converts a passed PyMongo cursor into a dictionary. Also a threshold can be passed, so that all elements below this threshold will be summarized in the category 'Others'.        
The **display_table** function displays a dictionary with research subject or research field information as table, the percentage for each entry is added. Besides the dictionary (data), a flag is passed to indicate if the dictionary contains research subjects (subject = True) or research fields (subject = False).       
The **display_pie_chart** function displays a dictionary with research subject or research field information as pie chart. A dictionary (data), the sampe name (sample = Overall | GitHub | ACM | arXiv), the subject flag (true if research subjects should be displayed) are passed. The file name to store the pie chart as pdf is an optional parameter. 

In [None]:
def percentage(part, whole):
    return round(100 * float(part)/float(whole), 2)

def cursor_to_dict(cursor, limit=None):    
    areas = {}
    areas_cumulated = {}
    others = 0
    remove = []
    for elem in overview:
        for tmp in elem['counts']:
            if not tmp['k']:
                pass
            elif tmp['k'] == [''] or tmp['k'] == ['Multidisciplinary']:
                if 'Multidisciplinary' in areas:
                    areas['Multidisciplinary'] = areas['Multidisciplinary'] + tmp['v']
                else:
                    areas['Multidisciplinary'] = tmp['v']            
            elif len(tmp['k']) > 1:
                for elem in tmp['k']:
                    if elem in areas:
                        areas[elem] = areas[elem] + tmp['v']
                    elif not elem:
                        if 'Multidisciplinary' in areas:
                            areas['Multidisciplinary'] = areas['Multidisciplinary'] + tmp['v']
                        else:
                            areas['Multidisciplinary'] = tmp['v']                    
                    else:
                        areas[elem] = tmp['v']
            else:
                areas[tmp['k'][0]] = tmp['v']
    areas = {key: value for key, value in sorted(areas.items(), key=lambda item: item[1], reverse=True)}
    if limit:
        for k, v in areas.items():
            if v < limit:
                others = others + v
            else:
                areas_cumulated[k] = v
        areas_cumulated['Others'] = others
    else:
        areas_cumulated = areas
    return areas, areas_cumulated

def display_table(data, subject):
    composed_data = []
    total = sum(data.values())
    spec = 'Subject' if subject else 'Field'
    for key, value in data.items():
        composed_data.append([key, value, percentage(value, total)])
    print(pandas.DataFrame(composed_data, columns=["Research " + spec, "Number of Repos", 'Percentage']))

def display_pie_chart(data, sample, subject, saveChartName=None):
    color_dict = coloursSubject if subject else coloursField
    plt.pie([float(v) for v in data.values()], labels=[k for k in data.keys()],
            autopct='%1.1f%%', startangle=90, pctdistance=0.75, counterclock=False,
            colors=[color_dict[key] for key in [k for k in data.keys()]])
    plt.axis('equal')
    if saveChartName:
        plt.savefig(saveChartName+".pdf", bbox_inches = "tight")
    spec = 'subjects' if subject else 'fields'
    print('Research ' + spec + ' of research software repositories in the ' + sample + ' sample: ')
    plt.show()

## Assigned Research Domains
More than half of the repositories have two or more assigned research fields. This applies also for almost a third of the repositories regarding the research subject. In this case, each assigned research subject respectively research field is included in the classification.

In [None]:
repos = rs_repo_table.get_entries({})
numSubjects = {}
numFields = {}

for repo in repos:
    sub = len(repo['main_subject'])
    field = len(repo['subject'])
    if sub in numSubjects:
        numSubjects[sub] = numSubjects[sub] + 1
    else:
        numSubjects[sub] = 1
    if field in numFields:
        numFields[field] = numFields[field] + 1
    else:
        numFields[field] = 1
numSubjects = collections.OrderedDict(sorted(numSubjects.items()))
numFields = collections.OrderedDict(sorted(numFields.items()))
numFields = {k: numFields[k] for k in numFields.keys() if k < 9}
    
plt.bar(range(len(numSubjects)), list(numSubjects.values()), align='center', tick_label=list(numSubjects.values()))
plt.xticks(range(len(numSubjects)), list(numSubjects.keys()))
plt.xlabel('Number of Assigned Subjects')
plt.ylabel('Repositories')
# plt.savefig("NumSubjects.pdf", bbox_inches = "tight")
plt.show()

plt.bar(range(len(numFields)), list(numFields.values()), align='center', tick_label=list(numFields.values()))
plt.xticks(range(len(numFields)), list(numFields.keys()))
plt.xlabel('Number of Assigned Fields')
plt.ylabel('Repositories')
# plt.savefig("numFields.pdf", bbox_inches = "tight")
plt.show()

In [None]:
print('Number of repositories with more than one assigned research subject: ', 
      rs_repo_table.get_number_of_entries({'main_subject.1':{'$exists':True}}), 
      'in percent: ',
      percentage(rs_repo_table.get_number_of_entries({'main_subject.1':{'$exists':True}}), rs_repo_table.get_number_of_entries({})))
print('Number of repositories with more than one assigned research field: ', 
      rs_repo_table.get_number_of_entries({'subject.1':{'$exists':True}}),
      'in percent: ',
      percentage(rs_repo_table.get_number_of_entries({'subject.1':{'$exists':True}}), rs_repo_table.get_number_of_entries({})))

### Research Subject Distribution for all Respositories  
The majority of the repositories belongs to physical sciences, followed by repositories assigned to the life sciences.

In [None]:
overview = rs_repo_table.compose_subjects('main_subject')
data, data_cumulated = cursor_to_dict(overview)
display_pie_chart(data_cumulated, 'overall', True)
display_table(data, True)

### Research Field Distribution for all Repositories
With 39.9% of the repositories, computer science is the most strongly represented individual research field, followed by repositories assigned to biochemistry, genetics and molecular biology (9.2%). The remaining repositories pertain to various research fields, like mathematics (6.7%), engineering (6.1%), and further research fields accumulated in the Others category.

In [None]:
overview = rs_repo_table.compose_subjects('subject')
data, data_cumulated = cursor_to_dict(overview, 2800)
display_pie_chart(data_cumulated, 'overall', False)

display_table(data, False)

### Research Subject Distribution of the Repositories in the GitHub Sample
A closer look into the GitHub sample reveals that nearly half of the repositories are assigned to physical sciences, followed by the life and health sciences.

In [None]:
overview = rs_repo_table.compose_subjects('main_subject','github')
data, data_cumulated = cursor_to_dict(overview)
display_pie_chart(data_cumulated, 'GitHub', True)

display_table(data, True)

### Research Field Distribution of the Repositories in the GitHub Sample
Publications belonging to computer science (21.6%) and biochemistry together with genetics and molecular biology (13.8%) are slightly more commonly referenced than publications of other research fields.

In [None]:
overview = rs_repo_table.compose_subjects('subject', 'github')
data, data_cumulated = cursor_to_dict(overview, 2200)
display_pie_chart(data_cumulated, 'GitHub', False)

# display_table(data, False)

### Research Subject Distribution of the Repositories in the ACM Sample
Most of the repositories referenced by ACM publications are assigned to physical sciences (86.1%), with a great distance to social (6.2%) and life (5.6%) sciences.

In [None]:
overview = rs_repo_table.compose_subjects('main_subject','acm')
data, data_cumulated = cursor_to_dict(overview)
display_pie_chart(data_cumulated, 'ACM', True)

# display_table(data, True)

### Research Field Distribution of the Repositories in the ACM Sample
The most strongly represented research field is computer science (69.6%). More rarely are repositories assigned to engineering, mathematics, decision sciences, and environmental science.

In [None]:
overview = rs_repo_table.compose_subjects('subject', 'acm')
data, data_cumulated = cursor_to_dict(overview, 750)
display_pie_chart(data_cumulated, 'ACM', False)

display_table(data, False)

### Research Subject Distribution of the Repositories in the arXiv Sample
The majority of the repositories is referenced by arXiv publications assigned to physical sciences (95.6%). 

In [None]:
overview = rs_repo_table.compose_subjects('main_subject', 'arxiv')
data, data_cumulated = cursor_to_dict(overview)
display_pie_chart(data_cumulated, 'arXiv', True)

display_table(data, True)

### Research Field Distribution of the Repositories in the arXiv Sample
The metadata of arXiv publications provide information about the primary research field. About two thirds of the repositories belong to computer science, followed by repositories from earth and planetary sciences (7.1%). The remaining quarter of repositories is associated to various research fields.

In [None]:
overview = rs_repo_table.compose_subjects('subject', 'arxiv')
data, data_cumulated = cursor_to_dict(overview, 300)
display_pie_chart(data_cumulated, 'arXiv', False)

# display_table(data, False)

### Refining Computer Science of the arXiv Sample
Regarding the computer science portion, the majority of repositories is assigned to computer vision and pattern recognition (41,9%). A portion of 20.3% belongs to machine learning. Also publications belonging to robotics (11.3%) and to computation and language (12.6%) refer to GitHub repositories, while the remaining quarter of repositories is referenced from publications of various other disciplines in computer science, including software engineering.

In [None]:
overview = rs_repo_table.compose_arxiv_cs()
areas, areas_cumulated = cursor_to_dict(overview, 80) 

plt.pie([float(v) for v in areas_cumulated.values()], labels=[k for k in areas_cumulated.keys()],
        counterclock=False, autopct='%1.1f%%', startangle=90, pctdistance=0.8, 
        colors=plt.cm.Set3.colors)

plt.axis('equal')
#plt.savefig("arxivCS.pdf", bbox_inches = "tight")
print('arXiv group: refined computer science share:')
plt.show()