# Crossref: number of published articles

This notebook contains code in **Julia language** to calculate number of articles coming from each publisher every year. The script requires pre-processed Crossref tabulated files to be already available in *data/crossref* folder. To generate .tab files from publicly available Crossref dataset, see *crossref to tab.ipynb*.

The script will produce two kinds of output:

(a) a separate .txt file for every publisher in Crossref, that will contain total number of articles published before 2001, and every year after 2001. These data will be used to create figures in *sci-hub percent per publisher.ipynb* notebook.

(b) *crossref date count.tab* file, that will contain, for every date since 2003 (when Crossref index was established) the number of DOIs created in Crossref on that date. Only DOI of type *journal-article* and *proceedings-article* will be counted. The data will be used in *sci-hub db growth.ipynb* notebook.

The data is prepared for the paper 'From black open access to open access of colour: accepting the diversity of approaches towards free science'.

In [1]:
# the script will process Crossref .tab files in 48 parallel processes

using Distributed

addprocs(48 - nprocs(); exeflags = `--project=$(Base.active_project())`) ;

In [2]:
@everywhere begin
    
    using MD5,
          Glob,
          Dates,
          Random,
          ProgressMeter,
          DelimitedFiles,
          DataStructures

    dir = @__DIR__ ;
    
end

In [3]:
# create a list of all .tab files with Crossref data available

tabs = glob("*.tab", dir * "/../data/crossref")

31601-element Vector{String}:
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/0.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/1.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 25 bytes ⋯ [22m[39m"cessing/../data/crossref/10.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 26 bytes ⋯ [22m[39m"essing/../data/crossref/100.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 27 bytes ⋯ [22m[39m"ssing/../data/crossref/1000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10001.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10002.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10003.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22

The following function will take a name of the .tab file as an input and process it, returning two dictionary outputs:

- number of articles published every year, for every publisher
- number of DOIs indexed every day

In [4]:
@everywhere function counts(tab,
        
                            subset = ["journal-article",
                                      "proceedings-article"] ,
        
                            y_start = 2000,
                            y_end   = 2024)
    
    data = readdlm(tab, '\t', String, '\n', quotes = false, use_mmap = true)

    date_count,        # dictionary to store number of DOI created every day
    pub_year_count =   # dictionary to store N of articles published every year
    
    Dict(), Dict()

    # iterate through every DOI (record) in the tab file
    
    for record in eachrow(data)
        doi, t, created, published, pub = record
        year = parse(Int, published[1:4])

        # if DOI denotes article published in a journal or conference
        # increment the number of articles on the DOI creation date
        
        if t ∈ subset
            
            (created ∉ keys(date_count)) &&
            (date_count[created]  = 0)
             date_count[created] += 1
        end

        # skip if no publishing date or wrong one was specified
        
        (year < 1600)    && continue
        
        # articles before 2000 and from 2024
        # will be counted in total not yearly
        
        (year < y_start) && (year = y_start)
        (year > y_end)   && (year = y_end)

        # initialize counter for publisher
        
        if (pub ∉ keys(pub_year_count))
            pub_year_count[pub] = Dict()
            for y in y_start:y_end
                pub_year_count[pub][y] = 0
            end
        end
        
        # increment the corresponding year counter
        
        pub_year_count[pub][year] += 1
        
    end

    (date_count, pub_year_count)
    
end

The following function will add up results returned by *counts* function for each separate tab file:

In [5]:
@everywhere function plus(tab_a, tab_b)
    
    date_count,   pub_year_count   = tab_a
    date_count_b, pub_year_count_b = tab_b

    years(a, b) = merge(+, a, b)
    
    (merge(+,     date_count,     date_count_b),
     merge(years, pub_year_count, pub_year_count_b))
        
end

Run the parallel processing of all .tab files:

In [6]:
shuffle!(tabs)

date_count, pub_year_count =

@showprogress output = stdout (
    @distributed plus ( for tab in tabs counts(tab) end )) ;

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:01:00[39m2:19[39m


Preview calculation results for Elsevier:

In [7]:
pub_year_count["Elsevier BV"] |> SortedDict

SortedDict{Any, Any, Base.Order.ForwardOrdering} with 25 entries:
  2000 => 6504980
  2001 => 315887
  2002 => 313108
  2003 => 362274
  2004 => 404035
  2005 => 397714
  2006 => 442414
  2007 => 457491
  2008 => 478099
  2009 => 505701
  2010 => 501490
  2011 => 541953
  2012 => 571285
  2013 => 585170
  2014 => 609216
  2015 => 641024
  2016 => 659693
  2017 => 681003
  2018 => 704736
  2019 => 746609
  2020 => 795233
  2021 => 848953
  2022 => 941179
  2023 => 1014357
  2024 => 506362

Save the calculated counts of indexed DOIs by date:

In [8]:
writedlm(dir * "/../data/crossref date count.tab", date_count |> SortedDict)

Store calculated counts of published articles per year for every publisher:

In [9]:
@showprogress output = stdout (
for (pub, years) in pub_year_count

    hash = bytes2hex(md5(pub)) # the name of the file is MD5 hash of publisher name
    
    open(dir * "/../data/publishers/" * hash * ".txt", "w")  do oi
         years = years |> SortedDict
         println(oi, pub)
        [println(oi, string(y, "\t", n)) for (y, n) in years] end

end )

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:04[39m
