Review and Replication Report for White et al (2020) submitted to PeerJ
=========================================================

Reviewing the revised copy, December 2020

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
paper_data = pd.read_excel('peerj-51535-peerj-51535-data-v2-11Dec2020.xlsx', 
                           sheet_name='Data')

In [3]:
paper_data

Unnamed: 0,DOI,Evidence,Licence,OA Status,Title,Authors,Author count,Author count>20,Journal,Year,...,"Ministry of Business, Innovation and Employment",NZ Govt,National Institute for Health Research,National Institutes of Health,National Science Foundation (US),US Government,UK Government,Australian Government,Australian Government.1,NZG Check
0,10.1001/jama.2017.7219,open (via free pdf),,bronze,Effect of Robotic-Assisted vs Conventional Lap...,"Jayne, David; Pigazzi, Alessio; Marshall, Hele...",16.0,No,JAMA,2017,...,False,False,True,False,False,False,True,False,False,True
1,10.1001/jamacardio.2017.0175,oa repository (via OAI-PMH doi match),,green,Effect of Monthly High-Dose Vitamin D Suppleme...,"Scragg, Robert; Stewart, Alistair W.; Waayer, ...",9.0,No,JAMA Cardiology,2017,...,False,True,False,False,False,False,False,False,False,True
2,10.1001/jamacardio.2017.2941,,,closed,Vitamin D Supplementation and Cardiovascular D...,"Scragg, Robert; Camargo, Carlos A.",2.0,No,JAMA Cardiology,2017,...,False,True,False,False,False,False,False,False,False,True
3,10.1001/jamapediatrics.2017.1579,open (via free pdf),,bronze,Association of Neonatal Glycemia With Neurodev...,"Mckinlay, Christopher J. D.; Alsweiler, Jane M...",15.0,No,JAMA Pediatrics,2017,...,False,True,False,False,False,False,False,False,False,True
4,10.1001/jamapsychiatry.2016.4234,open (via free pdf),,bronze,Paternal Depression Symptoms During Pregnancy ...,"Underwood, Lisa; Waldie, Karen E.; Peterson, E...",7.0,No,JAMA Psychiatry,2017,...,True,True,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12925,10.19153/cleiej.20.1.2,oa journal (via doaj),cc-by,diamond,Comparison of Two Forced Alignment Systems for...,"Flores Sol\\U00F3Rzano, Sof\\U00Eda; Coto-Sola...",2.0,No,CLEI electronic journal,2017,...,False,False,False,False,False,False,False,False,False,True
12926,10.1037/xge0000268,oa repository (via pmcid lookup),,green,"Once a frog-lover, always a frog-lover?: Infan...","Martin, Alia; Shelton, Catharyn C.; Sommervill...",3.0,No,Journal of Experimental Psychology: General,2017,...,False,False,False,False,False,False,False,False,False,True
12927,10.1177/0961463X17701955,,,closed,The tyranny of clock time? Debating fatigue in...,"Snyder, Benjamin H",1.0,No,Time & Society,2019,...,False,False,False,False,True,True,False,False,False,True
12928,10.1080/13676261.2017.1316363,,,closed,"Youth studies, citizenship and transitions: to...","Wood, Bronwyn Elisabeth",1.0,No,Journal of Youth Studies,2017,...,False,False,False,False,False,False,False,False,False,True


Some basic analysis and confirmation of the results from the spreadsheet
--------------------------------------------------------------------------------------

As noted in the revised version there are 12,930 rows in the new sheet. According the revised article they then filter to articles with less than 21 authors (12,302) that are also journal articles (12,016). A subset of these have NZ corresponding authors (5,301).

In [4]:
paper_data[paper_data['Author count'] < 21].DOI.nunique()

12226

This still being the incorrect number it appears that some of the 'Author count' column are blank but these were counted as having less than 21 Authors.

In [5]:
paper_data[paper_data['Author count>20'] == 'No'].DOI.nunique()

12302

In [6]:
paper_data[(paper_data['Author count>20'] == 'No') &
            (paper_data['Author count'].isnull())].DOI.nunique()

75

In [7]:
paper_data[(paper_data['Author count'].isnull())].DOI.nunique()

82

In [8]:
paper_data[(paper_data['Author count'] <21) |
            (paper_data['Author count'].isnull())].DOI.nunique()

12308

Will set up paper_data so as to use the column presented in the data but it still doesn't seem quite right.

In [9]:
paper_data = paper_data[(paper_data['Author count>20'] == 'No') &
                        (paper_data['Genre'] == 'journal-article') ]

Table 3 Gives figures for closed and open as a whole for all articles in the dataset. Numbers are confirmed as the same

Table 5 provides Average citation counts for this set of data. Confirm identical numbers.

In [10]:
ppn_by_oa = paper_data.groupby('OA Status').agg(
    counts = pd.NamedAgg(column='DOI', aggfunc='count'),
    avg_citations = pd.NamedAgg(column='Crossref citations', aggfunc='mean'))
ppn_by_oa['percent'] = ppn_by_oa.counts / ppn_by_oa.counts.sum() * 100
ppn_by_oa

Unnamed: 0_level_0,counts,avg_citations,percent
OA Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bronze,1101,5.106267,9.162783
closed,7056,4.526077,58.721704
diamond,265,1.788679,2.205393
gold,1706,5.135991,14.197736
green,1259,7.521843,10.477696
hybrid,629,7.939587,5.234687


In [11]:
# Every article in the table has an entry for OA type
ppn_by_oa.counts.sum()

12016

Table 4 Gives the same figures (closed vs open) for articles with a NZ Corresponding Author. Confirm the same numbers as given. Table 6 gives division and average citations. Confirm the numbers are the same.

In [12]:
ppn_by_oa = paper_data[paper_data['NZ Corresponding author'] == 'Yes'].groupby('OA Status').agg(
    counts = pd.NamedAgg(column='DOI', aggfunc='count'),
    avg_citations = pd.NamedAgg(column='Crossref citations', aggfunc='mean'))
ppn_by_oa['percent'] = ppn_by_oa.counts / ppn_by_oa.counts.sum() * 100
ppn_by_oa

Unnamed: 0_level_0,counts,avg_citations,percent
OA Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bronze,423,4.926714,7.979626
closed,3502,3.687893,66.063007
diamond,95,1.452632,1.792115
gold,697,4.737446,13.148463
green,432,5.12963,8.149406
hybrid,152,6.842105,2.867384


In [13]:
# Every article in the table has an entry for OA type
ppn_by_oa.counts.sum()

5301

Confirm the figures as a percentage of the open articles

In [14]:
ppn_by_oa['percent_of_open'] = ppn_by_oa.counts / (ppn_by_oa.counts.sum() - ppn_by_oa.loc['closed', 'counts']) * 100
ppn_by_oa

Unnamed: 0_level_0,counts,avg_citations,percent,percent_of_open
OA Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bronze,423,4.926714,7.979626,23.513063
closed,3502,3.687893,66.063007,194.663702
diamond,95,1.452632,1.792115,5.280712
gold,697,4.737446,13.148463,38.743747
green,432,5.12963,8.149406,24.013341
hybrid,152,6.842105,2.867384,8.449138


Repeat the analysis for those cases where there is an NZ corresponding author

Repeat for NZ reprint Authors

Calculating APCs
---------------------

The paper only reports on APCs for those articles with a NZ author. It might be interesting to look
at various implementations of the CAUL approach for APC calculation to see how much difference that
makes.

Confirm the same numbers as the paper.

In [15]:
apcs_by_oa = paper_data[paper_data['NZ Corresponding author'] == 'Yes'].groupby('OA Status').agg(
              counts = pd.NamedAgg(column='DOI', aggfunc='count'),
              av_cites = pd.NamedAgg(column='Crossref citations', aggfunc='mean'),
              known_apcs = pd.NamedAgg(column='USD APC', aggfunc='count'),
              total_apcs = pd.NamedAgg(column='USD APC', aggfunc='sum'),
              av_apcs = pd.NamedAgg(column='USD APC', aggfunc='mean'))
apcs_by_oa.loc[['gold', 'hybrid']]

Unnamed: 0_level_0,counts,av_cites,known_apcs,total_apcs,av_apcs
OA Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
gold,697,4.737446,697,1172029.0,1681.533716
hybrid,152,6.842105,110,281378.0,2557.981818


Embargo Periods
---------------------

In [16]:
embargoes = paper_data[(paper_data['NZ Corresponding author'] == 'Yes') &
                       (paper_data['OA Status'] == 'closed')].groupby('Archive accepted manuscript').agg(
    counts = pd.NamedAgg(column='DOI', aggfunc='count'))
embargoes['percent'] = embargoes.counts / embargoes.counts.sum() * 100
embargoes.sort_index()

Unnamed: 0_level_0,counts,percent
Archive accepted manuscript,Unnamed: 1_level_1,Unnamed: 2_level_1
12 months embargo,2115,60.394061
18 months embargo,318,9.080525
2 years embargo,171,4.882924
24 months embargo,41,1.17076
3 months embargo,3,0.085665
36 months embargo,1,0.028555
4 months embargo,1,0.028555
6 months embargo,73,2.084523
No information,128,3.655054
Permission must be obtained from the publisher,1,0.028555


Looking at Funders
-----------------------

In the revised paper the funder data has been pre-processed into normalised names. I can't check this directly but may be able to do a conceptual reproduction below. Here I will simply aim to reproduce the numbers in Tables 9/10.

In [17]:
# This is a not terribly readable way of getting all the unique ratings strings
nz_paper_data = paper_data[paper_data['NZ Corresponding author'] == 'Yes']
non_nz_paper_data = paper_data[paper_data['NZ Corresponding author'] == 'No']

def funder_summary(df, funders):
    l = []
    for funder in funders:
        f = df[df[funder] == True]
        num = len(f)
        d = dict(Funder = funder,
                 Number = num,
                 Closed = f[f['OA Status'] == 'closed'].DOI.count() / num * 100,
                 Bronze = f[f['OA Status'] == 'bronze'].DOI.count() / num * 100,
                 Gold = f[f['OA Status'] == 'gold'].DOI.count() / num * 100,
                 Diamond = f[f['OA Status'] == 'diamond'].DOI.count() / num * 100,
                 Hybrid = f[f['OA Status'] == 'hybrid'].DOI.count() / num * 100,
                 Green = f[f['OA Status'] == 'green'].DOI.count() / num * 100                 
                 )
        l.append(d)
    df = pd.DataFrame(l)
    return df

In [18]:
funders = ["Marsden",
               "Rutherford Discovery Fellowship",
               "Royal Society of New Zealand",
               "Health Research Council of New Zealand",
               "Ministry of Business, Innovation and Employment"]
all = funder_summary(paper_data, funders)
all

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,Marsden,505,53.663366,9.70297,12.673267,1.188119,4.950495,17.821782
1,Rutherford Discovery Fellowship,90,44.444444,8.888889,21.111111,0.0,13.333333,12.222222
2,Royal Society of New Zealand,714,54.201681,8.683473,14.42577,0.980392,6.022409,15.686275
3,Health Research Council of New Zealand,468,45.08547,13.461538,28.846154,0.641026,2.991453,8.974359
4,"Ministry of Business, Innovation and Employment",443,65.011287,7.223476,15.575621,0.902935,4.740406,6.546275


In [19]:
nz_corr = funder_summary(nz_paper_data, funders)
nz_corr

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,Marsden,362,56.906077,9.944751,12.983425,1.104972,4.143646,14.917127
1,Rutherford Discovery Fellowship,68,51.470588,11.764706,13.235294,0.0,11.764706,11.764706
2,Royal Society of New Zealand,515,57.087379,8.932039,14.368932,0.970874,5.048544,13.592233
3,Health Research Council of New Zealand,356,45.786517,13.202247,29.494382,0.842697,2.808989,7.865169
4,"Ministry of Business, Innovation and Employment",303,65.346535,8.910891,16.50165,0.660066,2.640264,5.940594


In [20]:
non_nz = funder_summary(non_nz_paper_data, funders)
non_nz

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,Marsden,143,45.454545,9.090909,11.888112,1.398601,6.993007,25.174825
1,Rutherford Discovery Fellowship,22,22.727273,0.0,45.454545,0.0,18.181818,13.636364
2,Royal Society of New Zealand,199,46.733668,8.040201,14.572864,1.005025,8.542714,21.105528
3,Health Research Council of New Zealand,112,42.857143,14.285714,26.785714,0.0,3.571429,12.5
4,"Ministry of Business, Innovation and Employment",140,64.285714,3.571429,13.571429,1.428571,9.285714,7.857143


In [21]:
funders = ["NZ Govt",
               "US Government",
               "Australian Government",
               "UK Government"]
all = funder_summary(paper_data, funders)
all

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,NZ Govt,1519,54.904542,9.282423,18.564845,0.921659,4.937459,11.389072
1,US Government,273,23.809524,18.315018,19.413919,2.197802,14.652015,21.611722
2,Australian Government,358,38.547486,11.173184,23.184358,0.558659,6.424581,20.111732
3,UK Government,199,12.060302,16.582915,26.633166,0.502513,20.100503,24.120603


In [22]:
nz_corr = funder_summary(nz_paper_data, funders)
nz_corr

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,NZ Govt,1098,56.466302,9.836066,18.852459,0.910747,3.825137,10.10929
1,US Government,52,32.692308,11.538462,17.307692,3.846154,13.461538,21.153846
2,Australian Government,68,45.588235,10.294118,22.058824,0.0,4.411765,17.647059
3,UK Government,33,33.333333,15.151515,15.151515,3.030303,12.121212,21.212121


In [23]:
non_nz = funder_summary(non_nz_paper_data, funders)
non_nz

Unnamed: 0,Funder,Number,Closed,Bronze,Gold,Diamond,Hybrid,Green
0,NZ Govt,421,50.831354,7.83848,17.814727,0.950119,7.83848,14.726841
1,US Government,221,21.719457,19.909502,19.909502,1.809955,14.932127,21.719457
2,Australian Government,290,36.896552,11.37931,23.448276,0.689655,6.896552,20.689655
3,UK Government,166,7.831325,16.86747,28.915663,0.0,21.686747,24.698795


Quick validation against COKI dataset
--------------------------------------------

For a further validation I pull data from an internal dataset for a quick comparison of OA 
and classes of OA for 2017 publications from the relevant universities. Query based on work
by Rebecca Handcock.

In [24]:
sql = """
WITH
  nz_dois AS (
  SELECT DISTINCT(doi)
  FROM  
    `academic-observatory.observatory.doi20201212`,
    UNNEST(affiliations.institutions) AS inst 

WHERE 
  crossref.published_year IN (2016,2017,2018) AND
  (
  "grid.16488.33" in (inst.identifier) OR
  "grid.9654.e" in (inst.identifier) OR
  "grid.148374.d" in (inst.identifier) OR
  "grid.29980.3a" in (inst.identifier) OR
  "grid.267827.e" in (inst.identifier) OR
  "grid.252547.3" in (inst.identifier) OR
  "grid.21006.35" in (inst.identifier) OR
  "grid.49481.30" in (inst.identifier)
  )
)

SELECT
  `academic-observatory.observatory.doi20201212`.doi as doi,
  
  crossref.published_year as year,
  unpaywall.is_oa as is_oa,
  unpaywall.gold_just_doaj as gold_doaj,
  unpaywall.green as green,
  unpaywall.green_only as green_only,
  unpaywall.hybrid as hybrid,
  unpaywall.bronze as bronze,
  open_citations.citations_total as citations,
  open_citations.citations_two_years as citations_2y
  
FROM
  `academic-observatory.observatory.doi20201212`,
  nz_dois

WHERE 
  `academic-observatory.observatory.doi20201212`.doi IN (nz_dois.doi)
"""

In [None]:
df = pd.read_gbq(sql, project_id='academic-observatory')

In [None]:
df

We get slightly fewer overall articles which is not surprising as the local data collection should be better overall. Basic check on levels of different categories and citation counts. Note that for this analysis the terms 'gold_doaj' corresponds to the use of gold in the paper, and 'green_only' corresponds to the use of green in the paper.

Interestingly this run gives a higher OA percentage than my previous replication. We also seem to capture much closer to the full number of articles in this set. Would be interesting to compare the overlap of DOIs.

In [None]:
df['closed'] = ~df.is_oa.astype(bool)
l = []
for col in ['is_oa', 'gold_doaj', 'hybrid', 'bronze', 'green', 'green_only', 'closed']:
    num = len(df[(df[col]==True) &
              (df.year == 2017)])
    pc = num / len(df) * 100
    d = dict(status = col,
             count = num,
             percent = np.round(pc),
             citations = np.round(df[(df[col]==True) & (df.year==2017)]['citations'].mean(), decimals=2),
             citations2y = np.round(df[(df[col]==True) & (df.year==2017)]['citations_2y'].mean(), decimals=2)
            
            )
    l.append(d)

new = pd.DataFrame(l)
new


Comparison of DOI sets
----------------------------

A fairly significant difference between the two sets of DOIs. This is most likely related to the publication date for different data sources. This makes direct comparison and merging of the datasets challenging but doesn't strongly effect the conceptual replication validity. 

Doing a quick check it does appear that the dataset contains a number of DOIs that presumably relate to the people who are currently or were affiliated with an NZ institution but that is not listed against the article itself. This is again not particularly surprising overall.

In [None]:
coki_dois = set([d.upper() for d in df.doi])
paper_dois = set([d.upper() for d in paper_data.DOI])
len(coki_dois & paper_dois)

In [None]:
len(paper_dois - coki_dois)

In [None]:
(paper_dois - coki_dois)