# Comparing downloads between Figshare and Digital Commons

## Data collection 

- setup collections in Figshare and Digital Commons
- records pending to be added to DC by school, randomly assign, control paper from different schools 
- 

## Method of analysis
- summary the analysis to be done 

In [8]:
#import libraries
import pandas as pd
from scipy import stats

#import the csv data file
df = pd.read_csv (r'C:\Users\dpdong\Documents\GitHub\2021-IR-compare-data_Git\DCvsFigshare-downloads-combined-v1.csv')

#a preview of first 5 rows of the dataset
print(df.head(3))
df.dtypes


                                          Identifier   IR  \
0  https://ink.library.smu.edu.sg/soss_research/3291  InK   
1  https://ink.library.smu.edu.sg/lkcsb_research/...  InK   
2  https://ink.library.smu.edu.sg/soss_research/3288  InK   

                                               Title  Apr  May  Jun  July  \
0  Humanist but not radical: The educational phil...    9   11  245    16   
1  Artificial intelligence as augmenting automati...   24   20   16    45   
2  Digitalising endangered cultural heritage in S...   32   32   32    22   

   Aug  Sep  Oct  Total  AugToOct  GS_avail uniq_PDF primary other comments  
0   23   24   26    354        73      True     True    TRUE            NaN  
1   47   39   66    257       152      True    False    TRUE            NaN  
2   22   57  102    299       181      True     True    TRUE            NaN  


Identifier        object
IR                object
Title             object
Apr                int64
May                int64
Jun                int64
July               int64
Aug                int64
Sep                int64
Oct                int64
Total              int64
AugToOct           int64
GS_avail            bool
uniq_PDF          object
primary           object
other comments    object
dtype: object

### Downloads comparison by platform

In [None]:
#summary stats
df[["IR","Total"]].groupby("IR").describe()

#code not used: df.groupby("IR").mean()

In order to see whether there is a statistically significant difference between download counts of Figshare and Digital Commons, we conducted T-test to compare the mean downloads for items in the two repositories. 

<b>Null Hypothesis H0</b>: There is no difference for the average paper downloads of Figshare and Digital Commons. 

<b>Alternative Hypothesis H1</b>: The average paper downloads differ between Figshare and Digital Commons. 

The <b>significance level</b> is set to be 0.05. 

In [None]:
#subsetting data
dc = df.query('IR == "InK"')['Total']
fig = df.query('IR == "RDR"')['Total']

t_output = stats.ttest_ind (dc, fig, equal_var=False)
display(t_output)


The p-value is calculated to be 0.596. Therefore the Null Hypothesis is accepted, and we conclude that the there is no statistically significant difference in the download counts between the two platforms. 


### Additional analysis on Google Scholar availability

Further exploratory analysis has been done to explore if there is any interesting patterns depending on whether and how records are indexed by Google Scholar. 

#### 1. Download counts comparison between records indexed and not indexed by Google Scholar


In [25]:
from scipy.stats import mannwhitneyu

#check whether sample is normally distributed using Shapiro-Wilk test 
#subsetting data for downloads between August to October (only include data after fixing GS indexing issue for Figshare)
n_GS=df.query('GS_avail == 0')['AugToOct']
y_GS=df.query('GS_avail == 1')['AugToOct']

#perform t-test
ttest_GS = stats.ttest_ind (y_GS, n_GS, equal_var=False)
display(ttest_GS)
#perform Mann–Whitney U test (non-parametric test) as the sample size for n_GS is considered small. 
utest_GS = stats.mannwhitneyu (y_GS, n_GS)
display(utest_GS)
#both shows that the difference in downloads not significant


Ttest_indResult(statistic=0.9990946794329939, pvalue=0.3218242664556237)

MannwhitneyuResult(statistic=581.5, pvalue=0.9704388586933024)

#### 2. Download counts comparison between records that provide unique PDF in Google Scholar (for records that are indexed by GS)

In [31]:
#use t-test to compare as N>30
#subsetting data
n_uniq = df.query('uniq_PDF == 0')['Total']
y_uniq = df.query('uniq_PDF == 1')['Total']

df[["uniq_PDF","Total"]].groupby("uniq_PDF").describe()


Unnamed: 0_level_0,Total,Total,Total,Total,Total,Total,Total,Total
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
uniq_PDF,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
False,43.0,31.395349,47.7863,0.0,6.0,16.0,35.5,257.0
True,34.0,98.794118,173.499416,4.0,16.75,40.0,90.75,934.0


In [29]:
ttest_uniq = stats.ttest_ind (y_uniq, n_uniq, equal_var=False)
display(ttest_uniq)

Ttest_indResult(statistic=2.2001087678993465, pvalue=0.03412649543963407)


#### 3. Will records that are the primary record on Google Scholar receive higher downloads compared to the ones that are not? 

In [42]:
n_primary = df.query('primary == "FALSE"')['Total']
y_primary = df.query('primary == "TRUE"')['Total']
df[["primary","Total"]].groupby("primary").describe()


Unnamed: 0_level_0,Total,Total,Total,Total,Total,Total,Total,Total
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
primary,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FALSE,22.0,12.090909,11.363723,0.0,4.0,7.5,21.25,42.0
TRUE,52.0,82.461538,146.563573,1.0,13.75,37.5,79.5,934.0
not sure,3.0,51.666667,15.044379,36.0,44.5,53.0,59.5,66.0


In [43]:
#use t-test to compare, N=22, still ok 
ttest_primary = stats.ttest_ind (y_primary, n_primary, equal_var=False)
display(ttest_primary)
#result is significant - primary records receive more downloads 

Ttest_indResult(statistic=3.437979553995462, pvalue=0.001155883277678811)

In [44]:
#perform Mann–Whitney U test (non-parametric test) as the sample size for n_primary is considered small. 
utest_primary = stats.mannwhitneyu (y_primary, n_primary)
display(utest_primary)

MannwhitneyuResult(statistic=913.5, pvalue=5.465982925266728e-05)

Both tests support that the difference between primary records and non-primary records are significant. 