# Comparing discoverability of Digital Commons and Figshare

## Background
In this study, we compared the platform discoverability in terms of download counts of two hosted IR solutions, Digital Science and Figshare, using a randomized controlled experiment. We also explored the patterns related to the indexing of records in Google Scholar. 

This case study is conducted using the institutional repository and data repository of Singapore Management University (SMU), respectively hosted on Digital Commons (https://ink.library.smu.edu.sg/) and Figshare (https://researchdata.smu.edu.sg/).


## Method

### Hypothesis Development
The main purpose of this study is to explore if there is any platform difference between Digital Commons and Figshare in attracting downloads to deposited academic publications. We experimented with two randomly selected groups of full text journal articles uploaded to both platforms around the same time. The usage and download statistics of both groups were tracked and monitored over 7 months. We work with the assumption that other factors affecting downloads such as quality of the article and popularity of research topics will be randomized among the two groups. Therefore, the difference in download counts can serve as a reasonable approximation of the platform discoverability difference between Digital Commons and Figshare. 

We established the following hypothesis: 

***H0: There is no difference in average paper downloads between Figshare and Digital Commons.***

***H1: The average paper downloads differ between Figshare and Digital Commons.***

The significance level is set to be 0.05.

### Data Collection

A total of 96 journal article records with full-text PDFs were exported from SMU's Current Research Information System (CRIS). Half of the records (48) were uploaded to InK, SMU’s IR hosted on Digital Commons, and the other half to SMU RDR on Figshare. 

The journal article metadata and full-text PDF were uploaded to both platforms towards the end of March 2021. Monthly download count statistics of each article from both platforms were collected from April to October 2021. 
During the course of the study, a few records were removed because of issues such as duplication or author request. The final dataset as of Dec 2021 contains 45 valid records from Digital Commons and 47 from Figshare. 

In addition to testing the proposed hypothesis, we were also interested in exploring how records from our repositories are indexed in Google Scholar. In late September, we did a round of checking and data collection about Google Scholar indexing and added 3 additional fields to the dataset. We searched for each article by title and checked on the following:

1) Whether the record is indexed in Google Scholar at all

2) If the record is in Google Scholar, whether our record is the only copy providing a unique PDF among the different versions of the same article

3) Whether our record is shown as the primary record in Google Scholar

The full dataset is available at https://doi.org/10.25440/smu.19121768. 


## Analysis & Results

In [None]:
pip install pandas
pip install scipy

In [None]:
# import libraries
import pandas as pd
from scipy import stats

In [None]:
# import the csv data file
df_data = pd.read_csv ('DCvsFigshare-downloads-combined-v1.csv')

In [None]:
# a preview of first 5 rows of the dataset
print(df_data.head(5))
df_data.dtypes


### Downloads comparison by platform

In [None]:
#summary stats
df_data[["IR","Total"]].groupby("IR").describe()

#check the mean df_data.groupby("IR").mean()

In order to see whether there is a statistically significant difference between download counts of Figshare and Digital Commons, we conducted T-test to compare the mean downloads for items in the two repositories. 

<b>Null Hypothesis H0</b>: There is no difference for the average paper downloads of Figshare and Digital Commons. 

<b>Alternative Hypothesis H1</b>: The average paper downloads differ between Figshare and Digital Commons. 

The <b>significance level</b> is set to be 0.05. 

In [None]:
#subsetting data
dc = df_data.query('IR == "InK"')['Total']
fig = df_data.query('IR == "RDR"')['Total']

t_output = stats.ttest_ind (dc, fig, equal_var=False)
display(t_output)


The p-value is calculated to be 0.596. Therefore the Null Hypothesis is accepted, and we conclude that the there is no statistically significant difference in the download counts between the two platforms. 


### Additional analysis on Google Scholar availability

Further exploratory analysis has been done to explore if there is any interesting patterns depending on whether and how records are indexed by Google Scholar. 

#### 1. Download counts comparison between records indexed and not indexed by Google Scholar


In [None]:
from scipy.stats import mannwhitneyu

#check whether sample is normally distributed using Shapiro-Wilk test 
#subsetting data for downloads between August to October (only include data after fixing GS indexing issue for Figshare)
n_GS=df_data.query('GS_avail == 0')['AugToOct']
y_GS=df_data.query('GS_avail == 1')['AugToOct']

#perform t-test
ttest_GS = stats.ttest_ind (y_GS, n_GS, equal_var=False)
display(ttest_GS)
#perform Mann–Whitney U test (non-parametric test) as the sample size for n_GS is considered small. 
utest_GS = stats.mannwhitneyu (y_GS, n_GS)
display(utest_GS)
#both shows that the difference in downloads not significant


#### 2. Download counts comparison between records that provide unique PDF in Google Scholar (for records that are indexed by GS)

In [None]:
#use t-test to compare as N>30
#subsetting data
n_uniq = df_data.query('uniq_PDF == 0')['Total']
y_uniq = df_data.query('uniq_PDF == 1')['Total']

df_data[["uniq_PDF","Total"]].groupby("uniq_PDF").describe()


In [None]:
ttest_uniq = stats.ttest_ind (y_uniq, n_uniq, equal_var=False)
display(ttest_uniq)


#### 3. Will records that are the primary record on Google Scholar receive higher downloads compared to the ones that are not? 

In [None]:
n_primary = df_data.query('primary == "FALSE"')['Total']
y_primary = df_data.query('primary == "TRUE"')['Total']
df_data[["primary","Total"]].groupby("primary").describe()


In [None]:
#use t-test to compare, N=22, still ok 
ttest_primary = stats.ttest_ind (y_primary, n_primary, equal_var=False)
display(ttest_primary)
#result is significant - primary records receive more downloads 

In [None]:
#perform Mann–Whitney U test (non-parametric test) as the sample size for n_primary is considered small. 
utest_primary = stats.mannwhitneyu (y_primary, n_primary)
display(utest_primary)

Both tests support that the difference between primary records and non-primary records are significant. 