# Compiling the dataset used for analyses within the paper

The methods used below were employed to compile a best-effort dataset of software stored in UK institutional repositories, along with descriptive traits and variables.  New and improved software was developed based on the learnings of the research process.  These functions should be considered deprecated in favour of the new approach taken (LINK). The details below are given to demonstrate how the dataset was derived.  Many additions/amendments were made manually as described below.

## Build main dataset of descriptives from CORE.ac.uk API

Generate the JSON objects for the descriptive data of each repository:

In [None]:
import requests
import json
from pprint import pprint
import pandas as pd

pd.options.display.max_columns = None

API_ENDPOINT = "https://api.core.ac.uk/v3/"

'''Functions to retrieve data from CORE API v3. Authorised using the ./apikey from core.ac.uk.
Based on examples provided by CORE at https://github.com/oacore/apiv3-webinar/
'''

def get_API_Key() -> str:
    '''Retrieve the API key from project root folder.'''
    with open("./apikey", "r") as apikey_file:
        api_key = apikey_file.readlines()[0].strip()
    return api_key

def get_core_providers_details(country_code, api_key) -> list:
    """ Gets all descriptive details for all Core.ac.uk UK-based data providers"""
    results = base_query_api("search/data-providers", "location.countryCode:" + country_code, api_key)
    list_of_dicts = []
    for provider in results['results']:
        list_of_dicts.append(provider)

    return list_of_dicts

def strip_http(df_in: pd.DataFrame) -> pd.DataFrame:
    df_in['URL'] = df_in['URL'].str.replace('http://', '')

    return df_in


def strip_https(df_in: pd.DataFrame) -> pd.DataFrame:
    df_in['URL'] = df_in['URL'].str.replace('https://', '')

    return df_in

def base_query_api(url_fragment: str, query: str, api_key: str, limit=300):
    ''''''
    headers = {"Authorization": "Bearer " + api_key}
    query = {"q": query, "limit": limit}
    response = requests.post(f"{API_ENDPOINT}{url_fragment}", data=json.dumps(query), headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error code {response.status_code}, {response.content}")
        
api_key = get_API_Key()
data = get_core_providers_details('GB', get_API_Key())

[Optional] Print the data:

In [None]:
display(data)

-Create a dataframe from this data  
-Rename oaiPmhUrl to URL  
-Remove http/s from urls  
-Set index column to be URL  


In [None]:
df_all_provider_details = pd.DataFrame.from_dict(get_core_providers_details('gb', get_API_Key()))
df_all_provider_details.rename(columns= {'oaiPmhUrl':'URL'},inplace=True)
df_all_provider_details.rename(columns= {'software':'ris_software'},inplace=True)
strip_http(df_all_provider_details)
strip_https(df_all_provider_details)
df_all_provider_details.set_index(keys='URL', inplace=True)

[Optional] Print the dataframe

## Additional data

**Russell_member** was manually added to the dataset by compiling a list of RG institutions from https://russellgroup.ac.uk/.


russell_members = [
        "University of Birmingham",
        "University of Bristol",
        "University of Cambridge",
        "Cardiff University",
        "Durham University",
        "University of Edinburgh",
        "University of Exeter",
        "University of Glasgow",
        "Imperial College London",
        "King's College London",
        "University of Leeds",
        "University of Liverpool",
        "London School of Economics & Political Science",
        "University of Manchester",
        "Newcastle University",
        "University of Nottingham",
        "University of Oxford",
        "Queen Mary, University of London",
        "Queen's University Belfast",
        "University of Sheffield",
        "University of Southampton",
        "University College London",
        "University of Warwick",
        "University of York",
    ]

**RSE_group** was added manually from data provided at https://github.com/socrse/rse-groups/blob/master/groups.toml

**ris_software_enum** was derived from the ris_software provided by Core.  This was incomplete and manual additions were made to fill missing data.


In [None]:
import numpy as np
df_all_provider_details['ris_software'] = df_all_provider_details['ris_software'].str.lower()

ris_conditions = [
    (df_all_provider_details['ris_software'].str.contains('pure',na=False)),
    (df_all_provider_details['ris_software'].str.contains('eprints',na=False)),
    (df_all_provider_details['ris_software'].str.contains('dspace',na=False)),
    (df_all_provider_details['ris_software'].str.contains('worktribe',na=False)),
    (df_all_provider_details['ris_software'].str.contains('figshare',na=False)),
    (df_all_provider_details['ris_software'].str.contains('haplo',na=False)),
    (df_all_provider_details['ris_software'].str.contains('esploro',na=False)),
    df_all_provider_details['ris_software'] == None
    ]

# create a list of the values we want to assign for each condition
values = ['pure', 'eprints', 'dspace', 'worktribe', 'figshare', 'haplo', 'esploro', None]

# add new column and use np.select to assign values to it using lists as arguments
df_all_provider_details['ris_enum'] = np.select(ris_conditions, values)

# display updated DataFrame
display(df_all_provider_details)


**Manual_Num_sw_records** was created by directly visiting the repository via web browser and manually filling the value via search returns.  See the paper appendix for raw queries. 

**Category** was filled manually depending on the search results (see paper)

## Additional repositories

As mentioned in the paper, additional repositories were found in other aggregation sites that are not CORE data providers.  Data for these was filled manually via browser searches.  The additional institutions were:  

https://research.aber.ac.uk  
https://uobrep.openrepository.com  
https://figshare.cardiffmet.ac.uk  
https://pure.hartpury.ac.uk  
https://researchportal.hw.ac.uk  
https://pure.uhi.ac.uk  
https://research.leedstrinity.ac.uk  
https://figshare.le.ac.uk  
https://repository.lboro.ac.uk  
https://nua.repository.guildhe.ac.uk  
https://research-portal.uws.ac.uk  
